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[57] ABSTRACT 

A method for processing heterogeneous data including high 
level specifications to drive program generation of informa- 
tion mediators, inclusion of structured file formats (also 
referred to as data interface languages) in a uniform manner 
with heterogeneous database schema, development of a 
uniform data description language across a wide range of 
data schemas and structured formats, and use of annotations 
to separate out from such specifications the heterogeneity 
and differences that heretofore have led to costly special 
purpose interfaces with emphasis on self-description of 
information mediators and other software modules. 

18 Claims, 5 Drawing Sheets 
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INTEGRATION PLATFORM FOR 
HETEROGENEOUS DATABASES 

This application claims priority of Provisional U.S Pat. 
Application No. 60/030,215, filed Nov. 5, 1996 the subject 
matter of this application is fully incorporated herein. 

This invention was partially fiinded by the U.S. Govern- 
ment and the U.S. Government has certain rights to the 
invention. 

The present invention is a method for processing hetero- 
geneous data and more particularly is a method for hetero- 
geneous data which uses an interoperability assistant module 
with specifications for transforming the data into a common 
intermediate representation of the data using the specifica- 
tions and creating an information bridge with the interop- 
erability assistant module through a process of program 
generation. 

Currently, databases used for design and engineering 
employ a variety of different data models, interface 
languages, naming conventions, data semantics, schemas, 
and data representations. Thus a fundamental problem for 
concurrent engineering is the sharing of heterogeneous 
information among a variety of design resources. Successful 
concurrent engineering also requires access to data from 
multiple stages of the design life-cycle, but the diversity 
among data from different tools and at different stages 
creates serious barriers. 

Although there have been several efforts directed at 
heterogeneous databases, a significant need continues to 
exist for design, engineering, and manufacturing applica- 
tions to be able to easily access and import heterogeneous 
data. Efforts to develop global query languages do not 
address the large group of users who want to see the world 
of external data as if it were an extension of their existing 
system and its specialized representation. These users may 
not wish to learn a different global representation — and 
more importantly, their expensive design tools can only 
work with their one specialized representation. 

Database gateways, Common Object Request Broker 
(CORBA), and Open Database Connectivity (ODBC) inter- 
faces which purport to address heterogeneity only do so at 
a relatively superficial level. Database gateways provide 
communication mechanisms to external systems — but only 
if those systems have a relational interface. That is, the 
system must either processes SQL queries or at least must 
provide a relational application programming interface 
(API). Since an API is a set of functions to be accessed via 
remote or local procedure invocation, a relational API pre- 
sents data in terms of named arrays (relation tables) that 
contain flat tuples, with one data value per column in each 
tuple, and with the number of columns fixed per named 
array. The functions' names and parameters are defined in 
the API and may vary depending upon the gateway. Appli- 
cation programmers must write their programs to invoke the 
functions of a particular API. The proposal for an ODBC 
standard offers the potential for a single agreed upon naming 
of the functions in the API. In addition, the CORBA 
approach allows more than a single API * standard* to 
coexist, as it can locate an appropriate interface from a 
library of interfaces. 

In all these cases, the programmer still has to write the 
application code to invoke the several functions defined in 
the interface. Data transformations and reformatting may be 
needed on the source side of the API, the target side of the 
API, or often on both sides. All of this is left to the 
programmers, who must implement this on a case by case 
basis. Unfortunately, there is little or no pre-existing soft- 
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ware from which these translators can be built, and each 
effort usually begins from scratch. While some vendors offer 
import translators from a few common formats, in general 
these do not provide sufficient interoperability. 

5 Moreover, if the target use of the data expects non- 
relational data( e.g., linked, nested, or other format) then 
additional data transformation will be needed, and normally 
this too can involve a significant programming effort. Even, 
within the relational data mode., there usually are several 

10 ways of designing the relational tables, in terms of which 
attributes are in which tables — that is, there usually is more 
than one way to normalize the data. If the application needs 
to see the data differently than the API provides, then data 
transformations are needed. Thus, it is often necessary for an 

15 organization to write specialized translators for their par- 
ticular requirements, and each effort usually begins from 
scratch. While some vendors offer import translators from a 
few common formats, in general these do not provide 
sufficient interoperability. 

20 Previous related work in the database field can be 
grouped roughly into three areas: access to heterogeneous 
databases (HDBs), schema integration, and object encapsu- 
lation of existing systems. 

Access to HDBs: Language features for multidatabase 

25 interoperability include variables which range over both data 
and metadata, including relation and database names, and 
expanded view definitions with provisions for updatability. 
Selected translation or mapping approaches are known and 
others which are based upon the 5-level architecture are 

30 described in Sheth and Larson. Approaches to semantics- 
based integration and access to heterogeneous DBs where 
semantic features are used to interrelate disparate data and 
resolve potentially different meanings are known. Seen from 
a higher level, what are needed are mediation services 

35 among different representations and systems. 

Schema Integration: Database design tools have been 
applied to schema integration, and related work formalizes 
interdatabase dependencies and schema merging. Some 
related approaches utilize view-mapping, and data migra- 

40 tion. Some of the work in this area assumes that the resulting 
target schema or view is to be relational rather than another 
heterogeneous representation. Much of the work on data 
semantics also is applicable to semantics of schema inte- 
gration. 

45 Object Encapsulation: Since object technology can hide 
the implementation and data structures within objects, it is 
interesting to consider encapsulation and the hiding of 
heterogeneity within object interfaces, with one or more 
interfaces being specially crafted for each heterogeneous 

50 database. Some of the work in this area includes the FIND IT 
system [6% object program language interfaces, and an 
execution environment for heterogeneous software inter- 
faces. These object approaches serve to hide, rather than 
obviate, the specialized programming which still is needed 

55 for each application. Sciore has done interesting work on the 
use of annotations to support versioning, constraint 
checking, defaults, and triggers. 

SUMMARY OF THE INVENTION 

60 One aspect of the present invention is drawn to a method 
for integrating heterogeneous data embodied in computer 
readable media having source data and target data including 
providing an interoperability assistant module with specifi- 
cations for transforming the source data, transforming the 

65 source data into a common intermediate representation of 
the data using the specifications, transforming the interme- 
diate representation of the data into a specialized target 
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representation using the specifications. An information application data. With this approach, the description of a 

bridge is created with the interoperability assistant module specialized information resource would be given in terms of 

through a process of program generation and the source data the agreed upon standard model. 

is processed through the information bridge to provide target When a standardized model or other neutral model is 

data wherein the target data is in a non-relational form with 5 selected, then for N client sites or databases, one needs 2N 

respect to the source data. transformations — i.e. N bi-directional mappings between 

The DAtabase Integration SYstem (DAISy) of the present each client information resource and the central model, 

invention provides both high-level user interfaces and pro- Creation of such standardized application models is the goal 

gram level access across heterogeneous databases (HDBs), of various standards efforts, including the PDES/STEP 

allowing integration of a wide variety of information 10 effort — DAISy can interface with such standardized models, 

resources including relational and object databases, CAD When a standardized model is not available, other 

design tools, simulation packages, data analysis and visual- approaches may be pursued. 

ization tools, and other software modules. DAISy eliminates £ V en when there is no standard model, semantics may be 

tedious and costly specialized translators. A key focus is on captured in terms of islands of agreement — that is, localized 

reusability of components through a specification-based 15 semantic models. Each local model would be self-consistent 

approach. A declarative specification language is utilized to and utilize a common vocabulary (sometimes this is referred 

represent the source and target data representations. This to as an ontology). However, a term in different models or 

high level data structure specification (HLDSS) provides a islands may have different meanings. A collection of such 

uniform language for representing diverse databases and islands of agreement may address a substantial portion of the 

specialized file formats, such as produced by CAD tools. In 20 semantics of a large application. 

addition, a rule-like specification applies functional trans- ^ a example, in the United States the use of feet, 

formations in a manner similar to that of production rule mcne s, and miles for length forms one semantic island of 

systems and term-rewriting systems. These specifications agreement, while the use of pounds, ounces, and tons for 

are utilized to create information mediators and information we j g ht forms another island — these two semantic islands 

bridges each of which can access heterogeneous data 25 often QCCaT together, but one deals with length while the 

resources and transform that information for use by data- Qther deaJs Wlth we ight. Analogous but different semantic 

bases and specialized representations. In all cases, the data islands occur in Europe where meters and centimeters are 

is mapped to an intermediate internal format. This also used for \ tQ g^ m d kilograms and grams are used for 

provides reusability of each information mediator to support we ight. Even the meaning of the weight 'pound' also 

multiple applications. Each of the information bridges are 30 depends on the sema^c island, as there is both the troy 

created through a program generation process, and this system pound (which is 0.373 kg) and the avoirdupois pound 

process and the associated tools within the DAISy platform (which is 0.454 kg). The word 'pound' as a unit of measure 

are reused for each new information bridge which is created. also occurs j n another semantic island for monetary 

RRTFF OFSPRIPTION OF THE DRAWINGS 35 currcncv ' 35 in the British P ound Conversion between 

BRIEF DESCRIPTION OF IHb DRAW UN 35 diffcrem is]utds also can be time dependent, as in 

Other features of the present invention will become the conversion of dollars to yen. 

apparent as the following description proceeds and upon Another example of an island of agreement is the topo- 

reference to the drawings, in which: logical map, which takes on a precise meaning in conjunc- 

F1G. 1 shows the architecture of the database integration 40 tion with a legend that specifies which visual map repre- 

platform; sentations are associated with which topological features. 

FIG. 2 shows details of the interoperability assistant of the The legend appeals to a commonly agreed upon model of 

SYstem* topological maps, and in so doing it makes precise the 

_ ' , , , . semantics of the map by specifying the choice of features 

FIG. 3 shows a logical structure diagram; and represeatatioas t0 be used in that map. In some 

FIG. 4 shows a dependency graph; and me i ege nd helps to make a map self-describing. 

FIG. 5 shows a schema browser display. Many actual applications are not fortunate enough to have 

DETAILED DESCRIPTION OF THE ^ ither * n a 8 reed ^ standard model, nor to he describable 

^^ ^^^ INVENTION y an ade£ l uate set of localized models which are commonly 

50 accepted. Of course, one can create a semantic model for an 

While the present invention will be described in connec- application of interest, and such may be desirable. But when 

tion with a preferred embodiment thereof, it will be under- creating information mediators and data transformations, it 

stood that it is not intended to limit the invention to that is not desirable to preclude applications where neither the 

embodiment. On the contrary, it is intended to cover all time nor resources are available to first create such special- 
alternatives, modifications, and equivalents as may be 55 ized semantic models. Thus the present invention has been 

included within the spirit and scope of the invention as developed which not only encompasses standard application 

defined by the appended claims. models and local models when they exist, but which also 

The types of heterogeneity which are addressed by the supports interoperability in the absence of such models, 

present invention include different schemata, different data The key to accomplishing these goals across a wide 
models, differences in physical storage and access issues, 60 variety of applications is the use of high level descriptions 

and differences in semantics and meaning of the data. Such and specifications — for both the given application's repre- 

types of structural and semantic heterogeneity have been a sentation of its data, as well as for specification of the 

primary impediment to effective data interoperability across executable transformations. The information bridge meth- 

disparate organizations. odology also supports direct transformations in the absence 

An expeditious approach to address heterogeneity is to 65 of semantic models. The actual transformation process itself 

map specialized and idiosyncratic terminology into an is similar whether the target is an agreed upon standard 

agreed upon standard for description and representation of model or a just a specialized representation for a particular 
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application. What differs when an adequate semantic model Self -description is an important organizing principal for 
is not available is the extra effort to specify all the transfor- highly distributed very large heterogeneous information 
mations explicitly and to ensure that they are correct and networks. It enables the automatic self -registration of soft- 
complete, ware modules and software agents simply by their being 

Developing generalizable technology which can be 5 presented to a registry service, shown as schema repository 
applied to a wide variety of situations is highly desirable. 16* This registry then need only query the module via 
This is one of the reasons high level descriptions and command line arguments or via an applications program- 
specifications are used in the present invention. ming interface (API) to obtain sufficient registration and 

Since as high a level of input description as possible is ™*& requirements information. The application generation 

desired, with only essential details, much analysis of these 10 sta S es have beeD implemented to automatically generate 

specifications is needed to create and operationalize an such self-description information for the created apphcation 

information mediator. A simple interpreter of these specifi- code and generated executable modules, 

cations would not have had enough information to proceed The self -description information can aid the interoper- 

without such analysis. Thus a second motivation for the ability assistant (IA) 20 to decide if and when the bridge 

program generation approach was to find a convenient way 15 needs to be rebuilt, or what sub-parts need to be rebuilt, 

of capturing the results of the analysis, especially where Self-description information also describes the function and 

tailored code needed to be generated. purpose of the information mediator. Because all of this 

The high level specifications are used to drive application information is self-contained, there is no need to search for 

generators which tailor and/or create the necessary on configuration and source files, attempt to match up version 

transformations, programs for data access, and software 20 numbers, or wonder which readme files go with which 

interfaces. The result is generation of a specific information modules. 

mediator or information bridge for use between disparate The self-description includes the HLDSS and HLTRS 

information resources. Such mediation and transformation high level specifications, with a timestamp of when they 

may utilize an intermediate common model or may be a were created, the full file names of the primary intermediate 

direct mapping, depending upon the specification. 25 files and modules utilized during creation of this information 

In this manner, a variety of lands of information resources mediator bridge, the time of creation of the bridge, and 
utilizing a common specification language and internal additional information. Each of these self-description 
representation can be supported. The design can support attributes can be queried separately or as a group, the latter 
heterogeneous databases with different schemas and data being shown. Since some of these intermediate files were 
models, such as multiple relational and object-oriented themselves created by the LVs application generation 
databases, and other application packages, such as CAD, process, each such file also contains in its header a descnp- 
CAE, and CASE (Computer Aided Design, Computer Aided of the HLDSS or HLTRS which gave nse to its 
Engineering, and Computer Aided Software Engineering) production. One of the several benefits of such self- 
tools, which produce specialized data structure representa- „ description is that it simplifies the management of generated 
^ ons source code and the resulting compiled modules. This can be 

TTie overall architecture of the system is reviewed in FIG. especially useful in a large system. 
1 and consists of several primary modules. The information Once a set of basic self-description attributes, together 
bridge 1 transforms data from heterogeneous data resources with an extensible set of additional attributes are agreed 
2, 4 and 6, for example, respectively a Nano-Fabrication 40 upon, the module can list its full set of self-description 
Database, Simulation Tools, and a CAD/CAM (Computer attributes as one of the pieces of information it provides. So 
Aided Design/Computer Aided Manufacturing) Database long as the meaning of the extended attributes can be 
into a common intermediate representation and then into a determined by the registry, a substantial amount of opera- 
specialized target representation — as determined by the tional and semantic information can be conveyed automati- 
specifications. The target representation may be another 4S cally. 

database with a different data model and schema, or the There are several approaches to ensuring that a registry is 

target may be a specialized data structure needed by a design able to understand and use the extended attributes. The 

tool. simplest is that the full set be standardized, but only the basic 

A system user 8 may access the information bridge 1 in a set are required. A second approach is for the module to 

variety ways. One way of accessing uses an optional 50 provide a description of each extended attribute in some 

browser 10, shown as a graphical interface to view and agreed upon simple description language. The third 

browse the combined uniform schema 11 and data obtained approach is for the registry to consult with another registry 

from multiple data sources 2, 4, and 6. Queries also may be for a description of the additional features. One of the 

posed against a common view of the data collection in order software module's self -description attributes would be the 

to focus the resulting view for additional browsing. Another 55 identifier of the registry where these extended attributes are 

way of accessing the information bridge is through existing defined. In this manner, software modules can be imbued 

tools 12 and the various data representations 2, 4 and 6 those with the intelligence needed to make their capabilities and 

tools understand. In this case, the information bridge com- assumptions known. 

pletes the transformation process and imports the external A new software package then can be registered automati- 

data into this specialized representation. 60 cally once the existence of the package is known. In this 

The information bridge 1 is created by the interoperability case, the server invokes the package with the standard 

assistant (IA) module 20 through a process of program argument requesting its self-description, and the server then 

generation. The specific transformations which are compiled enters that information into its database. Later if the server 

into the information bridge are derived from the specifica- is queried as to the existence of a software package that 

tions which an integration administrator 14 provides, with 65 provides some set of functions, the server can respond by 

this process being facilitated by some of the tools provided checking the attributes and values of registered packages, 

by the IA 20 subsystem. and return information on all that match the query. Other 
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information can be provided in the self-description, such as 
the units of measurement that are used, semantic details, 
general information, and perhaps cross-referencing of other 
documentation (e.g., manual pages or documents that 
describe some standard implemented by the software). 

Because the self -description information is self-contained 
within the package itself, it increases the usefulness of the 
package by not requiring information about it to be scattered 
around the system in the form of documentation files, source 
code comments, configuration files, and post-it notes on the 
screen of the last person that compiled the package. There is 
a great deal of potential information that can be stored in the 
self-description, and while there are few servers that utilize 
that information currently, the system has been designed to 
provide such information and other metadata for the next 
generation of intelligent software agents. 

The I A 20 consists of several internal modules as shown 
in FIG. 2. The input to the I A consists of two high level data 
structure specifications (HLDSS) and a high level transfor- 
mation rule specification (HLTRS). These are created with 
the help of other tools associated with the I A subsystem. The 
HLDSS data representation is reusable for all applications 
which need to access this data resource, for either input or 
output and may be edited as desired. 

The two HLDSS data structure specifications describe the 
source 22 and target 24 representations respectively. Both 
the source 22 and target 24 may be very different heteroge- 
neous data resources. Each may be an object, relational, or 
other database, or a specialized data file and structure, or an 
application program interface (API) to a software package, 
such as a CAD tool. 

The structure within the IA subsystem consists of several 
intermediate stages and modules. Most of the code in the 
resulting information mediator is created through generation 
module 30 which is completely driven by the high level 
HLDSS and HLTRS specifications. The output of the IA is 
a compiled information bridge mediator 60, which when 
executed, provides the desired transformations between het- 
erogeneous representations. 

The execution of the IA 20 module is distinct from the 
execution of an information bridge 1. The IA module is 
active in a system generation phase arid * generates code 
which compiles into an information bridge. In other words, 
an information bridge is the end result of the system gen- 
eration phase; data resources are not involved during this 
phase. After an information bridge has been generated, then 
the bridge can be executed (the information bridge execution 
phase) at which time the data transformation occurs. 

From another perspective, if the same data transformation 
process is required more than once, on different data 
instances, then the IA analysis and code generation is only 
executed once (when the information bridge is built) and the 
information bridge is executed for each data transformation. 
The integration administrator would be the individual most 
likely to be involved with system generation whereas more 
general users would be involved with just the execution 
phase of the information bridges. 

The input that the IA module uses to generate the code for 
an information bridge comes from the three high-level 
specification files that were described above. As discussed 
earlier, the high level data structure specification (HLDSS) 
describes the logical and physical representation of a set of 
data resources. The production rules of the HLDSS allow the 
IA module to generate an internal intermediate logical 
representation of the data while the annotations on the 
production rules (and the attributes of these annotations) 
describe the physical representation of the data. 
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The IA module requires one HLDSS to describe the 
heterogeneous input data sources 22 and another HLDSS to 
describe the target representation 24. The third input to the 
IA module is the high level transformation rule specification 

5 (HLTRS) 23. The HLTRS 23 is a set of rewrite rules od the 
logical representation of the data so that the source data 
representation can be transformed (possibly through inter- 
mediate representations) to the target data representation. 
A number of activities occur during system generation 

10 processing in the IA generation module 30. First, the source 
22 and target 24 HLDSS must both undergo schema analy- 
sis. Schema analyzers, source schema analyzer 32 and target 
schema analyzer 52, parse the HLDSS and create logical 
structure diagrams (LSDs), source LSD 34 and target LSD 

15 54. The LSD is an internal context-independent representa- 
tion of the data resource schemata and structures. The actual 
data instances (which are only present during bridge execu- 
tion and not during bridge generation) populate an instance 
tree which is, for the most part, isomorphic to the schema 

20 tree. 

The schema analysis of the HLDSS specification lan- 
guage may be facilitated through the use of Lex and Yacc 
tools and is identical for both the source and target HLDSS. 
However, this does not account for the processing of the 

25 annotations and attributes of the HLDSS; this additional 
processing of the HLDSS files is different for the source 22 
and target 24 and is the recognizer generator 36 for the 
source HLDSS and view generator 56 for the target HLDSS. 
The annotations/attributes are processed in both cases to 

30 obtain information such as database and API interfaces, 
hosts and filenames. The data handling information is stored 
on the individual nodes of the schema trees and code is 
generated to access input data and produce output data. The 
stored information differs between the source and target, so 

35 as to distinguish between input schema trees and output 
schema trees. However, the two generators are logically 
similar and may be combined. 

If the input HLDSS 22 refers to structured files, such as 

^ design files, for specially formatted data, then the recognizer 
generator 36 also produces yacc-like code that the parsing 
tool (Antlr/Pccts) can process. Antlr is an acronym for 
ANother Tool for Language Recognition which is part of 
Pccts (the Purdue Compiler Construction Tool Set). This 

45 tool generates the actual recognition/parsing module that is 
incorporated into the information bridge in a particular 
embodiment of the invention. 

The above activities on HLDSS files are handled by the 
HLDSS parser. The HLDSS parser is called once to generate 

50 the code specific to the source HLDSS 22 and called (with 
different parameters) a second time to process the target 
HLDSS 24. 

tv^ The HLTRS specification is sent to the transformer ge n- 
erator 42 which gen erates code file s that desc ribe the tra ns- 

55 forrajiiOtt rules in a form usablet?y the dep en^encvgapE 46 
(See FTfr_j4)_whi ch controls the data, flow from source to 
target. 

The system generation process next compiles the auto- 
matically generated code as well as some fixed template 

60 code which cannot be compiled before the specification code 
is created. Finally, the code is linked with the class library 
44 and any other libraries that may be necessary to actually 
access data (e.g., the Oracle library, etc.). The output of 
generation module 30 is a mediator generator which forms 

65 the information bridge mediator 60 with mediator generator 
outputs 40, 50 and 58. Input data 62 is processed by the 
source recognizer 64 formed by recognizer generator 36, the 
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transformer engine 66 formed by transformer generator 42 
and the target builder 68 formed by view generator 56 to 
produce transformed target data 70. 

Turning now to a more detailed description of the IA 
shown in FIG. 2, the high level data structure specification 5 
(HLDSS) consists of three parts and is used to describe the 
logical and physical structure of the source and target 
databases, data structures, file formats, and associated infor- 
mation. Separate HLDSS's are used for the different source 
22 (input) and target 24 (output) data. If a data representation 1Q 
is used for input in some cases and output in other cases, 
then essentially the same HLDSS can be used for both. 

The HLDSS specification language is uniform regardless 
of the diverse database, data structure, and file representa- 
tions. The significant differences among these different data ^ 
representations is accounted for by annotations on the 
HLDSS productions (statements). These annotations give 
rise to different specialized interpretations of each 
production, based upon the annotation. The uniformity of 
the HLDSS language foreshadows the uniformity of the 
internal intermediate representation of the actual data within 20 
the information mediator/bridge 60. 

The three parts of the HLDSS are the grammar produc- 
tions (the primary part) and the type and annotation descrip- 
tions. 25 

Grammar productions specify the structure of the 
database, file format, or main memory data structures via a 
grammar-like set of extended BNF productions plus anno- 
tations. Each grammar production consists of: 

Header: Componentl Component2 . . . {Annotation 1} 3Q 

Components Componentil Componenti2 . . . {Annota- 
tion2} 

The interpretation of these productions depends upon the 
optional annotations. If the annotation specifies 
"RDB:<database.handle>" then the production refers to a 35 
relational database table where the left hand side (LHS) of 
the production, e.g., 'Header', is the table name, and the 
right hand side (RHS) components are the attributes/ 
columns which are to be selected. Each application of this 
production to the input would produce one such tuple from ^ 
the table. The annotation may optionally specify a full SQL 
query, and this production will then refer to the resulting 
virtual table. 

If the annotation is "ODB:<database.handle>" then the 
production refers to an object from an object-oriented data- 45 
base. The LHS is the name of the object, and the RHS 
components are the attributes or methods whose values are 
to be selected. An explicit OSQL query may be specified 
instead, in which case, the result will be treated as a virtual 
object (LHS) having the component names designated by 50 
the RHS. 

The key unifying idea here is that all databases and data 
representations revolve around several kinds of named 
aggregates of subcomponents. In rum, each aggregate may 
participate in one or more higher level aggregates. 55 

The nature of the aggregations may differ from one data 
representation to another. In a unified approach, the different 
kinds of aggregations and data representations are distin- 
guished through the use of annotations. Different kinds of 
aggregations include, for example: 60 

1) A relation in which each record is an ordered tuple or 
aggregation of the attribute fields of that record, 

2) An object which consists of the collection/aggregation 
of its attributes and methods, 

3) A multi-valued attribute/method of an object in which 65 
the collection/aggregation is the set-valued attribute or 
the set returned by a method, 



10 

4) A structured file in which a record consists of a 
sequence or aggregation of fields, some of which may 
be named aggregates themselves, and 

5) The repeating group and parent-child "set'* of hierar- 
chical models (e.g., IMS) and of Codasyl DBTG net- 
work models, in which the repeating instances or 
children are the aggregation associated with the parent 
node. 

In all these cases, the approach uniformly treats the right 
hand side (RHS) of each production or statement as the 
components of the aggregate, and the name of the aggregate 
is given by the left hand side (LHS) of that production. Each 
instantiation of a production is treated as an instance of that 
kind of data aggregation. So for a relational database, the 
LHS names the relation (real or virtual/derived) and the 
RHS names the attribute fields. Then an instance of this 
production represents a relational data tuple. 

The annotations on grammar productions are of the form 
"{<CategoryName>: <name>}", where <name> is just a 
reference to a subsequent line beginning with "Spec: 
<name> M — this serves to elaborate this annotation with 
further details. The CategpryName designates the primary 
category of the data source, and is selected from the top level 
of the taxonomy of heterogeneous data resources — which 
includes Relational DB, Object DB, File, URL, API, and 
DataStructure (i.e., main memory data structure). 

In the example shown below, the annotation name 'Syb' 
appears on a line of the grammar productions and serves to 
identify the more detailed specifications that follow below, 
on the lines which begin with "Spec". \ 
//Format of HLDSS: 

//Aggreg'ateStructure:Component 1 Component2 . . . {Anno- 
tation} 

//Type or Spec: data-objects or access-spec 
Start: manifest* 

manifest: header bound element+(numberofElements note+ 

(numberofNodes) 
header: Name Version Type Fmt {RDB:Syb} 
bound: numberOfElements numberOfNodes scaleX scale Y 

{ODB:Ont} 

element: elementNumber nodeNumber+(8) {ODB:Ont} 
node: nodeNumber xCoordinate yCoordinate {ODB:Ont} 
Type String: Name Version Type Fmt 
Type Integer: numberOfElements numberOfNodes element- 
Number codeNumber 
Type Float: scaleX scaleY xCoordinate uCoordinate 
Spec Syb: {dbty pe :sYB ASE, nETWORK: 

@SITE1.C0M:42} 
Spec Ont: {DBTYPE:Ontos, OSQL:"Where 

$l.Mesh.name+$Name", 
Network: @Site2.com} 

This specification accesses a Sybase relational table 
named * header" for attributes Name, Version, Type, and Fmt. 
It also accesses an ONTOS object database to obtain the 
'bound* information from the 'Mesh' object whose Name 
was just obtained from the relational tuple. The 'element' 
and 'node* data are obtained from the same Mesh object. 
'numberofElements' is the number of instances of 'element* 
data that occur for 'element*' and 'numberofNodes' is the 
number of instances for 'node+'. In turn, each 'element' has 
an elementNumber and eight nodeNumber data components, 
each of which is an integer. 

On lines which start with "Spec <name>:** there follows 
additional information about the annotation. This consists of 
a sequence of "Keyword: value" entries, where for certain 
keywords the value may be optional. The set of allowed 
keywords depends upon the main category or phyla of the 
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data resource. For example, the annotation "{RDB: Syb}" 
applies to the production whose left hand side is "headed*. 
RDB signifies the relational family of database types, and 
"Syb" names the annotation. The elaboration of "Spec: Syb" 
here identifies the relational DB as being Sybase, with the 
server at "@sitel.com*\ and here port 42 is indicated, 
though usually the port is not needed. 

Similarly, if the annotation is "ODB:<tag>" then the 
production refers to an object from an object-oriented data- 
base. The LHS is the name of the object, and the RHS 
components are the attributes or methods whose values are 
to be selected. An explicit OSQL query may be specified to 
qualify the objects of interest, and/or to limit the set of 
attributes/methods to be accessed — not all OODBs support 
such queries. 

An annotation applies to all subordinate productions (i.e., 
expansions of the non-terminals on the right hand side of the 
production) unless productions for those non-terminals pro- 
vide different annotations. 

The HLDSS provides a specification language using a 
linear textual syntax consisting a sequence of Lines of text. 
Corresponding to the HLDSS linear syntax, a graphical 
representation has been developed and is referred to as the 
logical structure diagram (LSD). 
{\ Referring again the FIG. 2, the HLDSS is parsed and 
processed by sch ema analyzers 32 and 52 tn create anno- 
tated logical structur e T diagr ams <ti>Dsi 34 and 54, each of 
which is a schematic structure graph (tike a parse tree) that 
represents the schema and data structures of the da ta 
resources. Logical structure diagrams provide a form" of 



meta -schema uTlhat LSlTs m ay be used to graphically 
descnbedifferent data models and schemas. The motivation 
for the LSD internal representation is that a context- 
independent uniform graph-based representation can repre- 
sent all anticipated data resource schemata and structures. 
The context-dependent interpretation is dictated by 
annotations, which are processed by the HLDSS preproces- 
sor and the recognizer generator 36 and view generator 56 
modules. 

Internally, the uniform LSD) structure represents aggre- 
gations (e.g. sets, hierarchically nested data, etc.) and asso- 
ciations (n-ary relationships, attributes, dependencies). 
Basically, the LSD consists of nodes and edges, where each 
node and edge logically may be associated with both an 
interpretation and a label, as described below. This interpre- 
tation does not change the uniform LSD structure, but rather 
enables us to see how the LSD corresponds or maps to (o r 
fromVt he external heterc^eneouTTorrriali; and Sljfuctu*r es7 

TfiusThe HLDSS'is the surface syntax and the LSD is the 
general schematic conceptual structure which is used inter- 
nally. Surface syntax other than the HLDSS could be utilized 
to create the schematic LSD initially. The graphical interface 
could be extended to support mult iple specification par a- 
digms. 

The logical structure diagram provides a uniform repre- 
sentation over diverse schemata and structures, and it does 
this by logically separating the representation problem into 
three layers: 

Syntax and structure of the LSD, which are independent 
of the data model representation. 

Interpretation of the LSD relative to different data model 
formalisms, such as a relation, an attribute, an object, or 
a method which returns a value, etc. Such interpreta- 
tions impact the meaning of the nodes and edges of the 
LSD. 

Labeling of specific LSD components to correspond to 
named application schema components — including ref- 



r erence to attribute names, relationship names, and 
object names in the application schema. 
These layers of syntax, interpretation, and labeling, 
enables the use of the same LSD syntax to represent very 
diverse data models and application schema. 

The ability to define language constructs that can be 
applied to many types of data models and schemas, hinges 
on the ability to define different interpretations of such 
uniform constructs. Then the differences among schemas 
and data models can be encapsulated into such interpreta- 
tions. Since a given information bridge may access data 
from multiple sources and/or produce data with multiple 
destinations, also attached to the LSD for convenience of 
processing, are the media-specific annotations that are part 
of the initial HLDSS specification. 
15 The above three representation layers may be imple- 
mented in the following forms: 

1) The descriptive representation language or formalism 
£s is the logical structure diagram. It is a logical graph 

' (often a tree) of data entities and logical dependencies. 
20 The fact that this descriptive language is uniform in its 
logical structure depends on the ability to indepen- 
dently specify the interpretation and naming relative to 
different data models and different databases. 

2) The choice of data model, as described in annotations, 
specifies interpretations such as: whether an association 
represent a relational table, or a relationship between 
two objects, or a relationship between an aggregate 
object and the members of the aggregate. These impor- 
tant differences are accounted for by the interpretation 

30 of the LSD constructs. 

3) The particular schema: the names of the particular 
relations or objects, and/or attributes. These names are 
utilized in the components of the LSD. 

More precisely, the logical structure diagram (LSD) is 
35 defined as a triple (N, E, I ), where N is a set of nodes, E is 
a set of edges over such nodes, and I is the interpretation of 
the nodes and edges relative to the data model and schema. 
Each edge E, is an tuple consisting of k E{ nodes from N and 
an optional edge label. The edge represents an association or 
40 relationship, and in general may be a hyperedge (thus 
making the LSD a hypergraph). Binary edges, that is, with 
two nodes plus an optional edge name also will be used. 



The interpretation I=(W» t) consists of two parts. 1^ is an 
interpretation over the nodes which partitions the node set N 
45 into subsets, each corresponding to a primary component 
type of the chosen data model (e.g., relation table names and 
attribute names). l £ is an interpretation which partitions the 
edges so that a subset of edges contains a common set of 
node(s) from one partition of \ N and this subset of edges 
50 relates these node(s) with nodes from a different partition of 
Ijy. Thus the interpretation I represents and makes explicit 
the primary implicit relationships of the data model. 

Thus for example, a relational data model would have an 
interpretation in an LSD where I consists of nodes which 
55 denote relation names, nodes which designate attribute 
names within the relations, and edges would connect rela- 
tion nodes and attribute nodes. Similarly, an object model 
would, designate nodes as object types, attributes, or 
methods, and edges would designate either: 1) a named 
60 relationship between an object type and one of its attributes, 
2) a subtype (specialization) relationship between an object 
type and its parent object type, or 3) an aggregation rela- 
tionship between an object type and its component object 
types. 

65L An example of a logical structure diagram 200 is shown 
ft in FIG. 3 and corresponds to the following HLDSS produc- 
tions: 
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Start (110): manifest+(112) 

manifest (114): header (120) bound (130) eiement+ 
(numberof Elements) (140) node+{numberOf Nodes) 
(150) 

header (120): Name (122) version (123) Type (124) Fmt 

(125){RDB: Syb} 
bound (130): numberOfElements (132) numberOf Nodes 

(133) scaleX (134) scaleY (135) {PDB: Ontl} 
element (142): elementNumber (143) nodeNumber4{8) 

(144) {ODB: Ont2} 
node (152): nodeNumber (153) xCoordinate (154) yCoor- 

dinate (155) {ODB: Ont3} 

The header tuple 120 represents data from a relational 
table while the data associated with the element object 140 
comes from an object database — note that element repre- 
sents a set or collection of element object instances. The 
sources of these and other data instances is described in the 
HLDSS specification. 

Thus the logical structure diagram (LSD) is a uniform 
representation of the HLDSS specification and represents 
the schematic structure of the data. With the source and 20 
target annotations, the interpretation of the LSD describes 
the heterogeneous data resources. When looked at without 
the annotations, the LSD provides a uniform tree structured 
description of the intermediate data within the transformer 
of the Information Mediator. A directed cycle would arise 25 
when a data structure refers to itself recursively in its schema 
or definition. Multiple parents would arise when both parent 
nodes refer to a single physical shared substructure. 

Since the intermediate internal structure of actual data 
mirrors the LSD schematic structure, the LSD instance tree 
or LSD parse tree reflects the actual structure of the data 
instances within the information bridge. The LSD schema 
and the LSD instances will be explicitly distinguished from 
each other only when this is not implied by context of the 
discussion. Thus the LSD provides a neutral data description 
language, which may be defined in graph-like terms utilizing 
nodes and edges, or may be defined in production-like terms 
in the HLDSS as just described. 

Pattern matching can serve as a general paradigm for 
database access against object-oriented schemas, other data 
models, as well as for specialized design file structures. The 
approach is motivated by the observation that a query could 
be specified by a subgraph or subset of the database schema 
together with annotations designating the outp uts and the 
selection predicates on certain components. 

Database patterns can be seen as consisting of three 
aspects: (1) general pattern constructs, (2) data model spe- 
cific constructs, and (3) reference to named components of 
a specific schema. As a result, the database patterns are 
applicable to object-oriented schemas and relational sche- 
mas as well as other data models. Queries may reference 
retrieval methods as well as explicitly stored attributes. 

As a result, heterogeneous databases can be supported 
with capabilities such as: queries against data from different 
schemas, providing a uniform view over multiple databases, 
and providing different views for different users — e.g., with 
respect to a view commensurate with their local database.-. 

A database pattern (DB pattern) may be defined in terms 
of selected components from the schema together with 
selection predicates. For an object-oriented database, the DB 
pattern will consist of objects, attributes and/or relationships 
from the schema, together with predicate qualifications to 
restrict the allowed data values and relationships. The 
restrictions may limit the combinations of data object 
instances which can successfully match the database pattern . 65 
Optional linking variables can be utilized to interrelate 
several data objects and their attributes. 
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When the DB pattern is applied against a database of 
instances, components in the pattern match or bind to data 
instances such that all conditions in the pattern are satisfied. 
The attribute and object names utilized in the pattern serve 
as implicit variables which are bound to these names. 

The DB pattern may match multiple data instances. Some 
components of the pattern may serve only to restrict possible 
matches of other components that are to be produced in the 
result of the pattern match. Thus components which are to 
appear in the result are designated as output components of 
the pattern. Each different valid combination of instances for 
the output components constitutes a match instance. The DB 
pattern may be thought of as returning a collection of 
bindings for each combination of the output variables — such 
that each combination satisfies the DB pattern qualifications. 

Database patterns provide the equivalents of selection, 
projection, and join. In addition they can provide recursion 
to obtain the closure of a relationship, as well as other 
constructs which are useful in non-relational schemas. The 
form of the patterns are based upon the structure of the 
schema, and thus such patterns are applicable not only to 
object oriented models but also to the relational model as 
well as more traditional hierarchical and network schemas. 

A database pattern P is the ordered triple [NDJE,F], where 
ND is a set of node descriptors, each composed of a node N,- 
and an optional predicate P ( . The collection of individual 
nodes Ni constitutes the set N. E is a set of edges defined in 
terms of nodes from N, as discussed below. F is the pattern's 
outform, which provides a generalization of the notion of 
projection. F can be utilized to specify structuring and 
organization of the resultant data instances. 

An edge E,. represents a relationship among nodes. In the 
general case, an edge is indicated by a tuple (Nl, N2, . . . , 
6, . . . , Nj^, N^) of nodes N, e N, and thus may be a 
hyperedge. 0 is an optional edge designator, which either 
may be an existing edge name, recursion over a given edge 
type, or a binary comparison function, such as a join 
condition. This general form of database pattern can be 
usefully treated as a hypergraph, in which some or all edges 
have k>2 nodes. When binary edges which consist of two 
nodes can be used, the database pattern P forms a graph. 

The process of matching the pattern P against the database 
D treats the DB pattern P as a mapping (multi-valued 
function) which is applied to D to produce the match set M. 
This pattern matching process is denoted M~P(D). It pro- 
duces a match set M which consists of all distinct outform/ 
output instances. 

The basic definition of database patterns is independent of 
the data model and the schema because the distinction is 
made between the following (conceptual) phases in the 
creation and use of database patterns. These are: 

1. Syntax and structure of database patterns, as defined 
above, which are independent of data model. 

2. Interpretations of database patterns relative to different 
data model formalisms: object-oriented, relational, etc. 
This impacts the meaning or interpretation of nodes and 
edges. 

3. Labeling of a specific database pattern relative to a 
given application schema — including reference to 
attribute, relationship, and object names in the appli- 
cation schema. This impacts the naming of the nodes 
and edges of the DB Pattern. 

4. Application of the pattern, including binding of pattern 
variables based on matches, and return of matching 
instances from the database. 

These database patterns provide a graphical representa- 
tion for global queries, and thereby support transparency 
with regard to location and heterogeneity of distributed data. 
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The high level transformation rule specification (HLTRS) 
23 consists of a set of condition-action transformation rules 
(sometimes also referred to as tree/graph rewrite rules), 
which are of the form: 

Triggering-Conditions^Transformation-Function:: 5 
Result-label 

The left side of a rule specifies the data objects on which 
the rule operates, the middle section of the rule specifies the 
action or transformation operation to be applied on each 
such input instance, and the right side specifies the label/ 10 
name of an output form. Thus, an example of a transforma- 
tion rule would be: 

Tokenl, Token2^TransformFcn(Tokenl Token2):: 
TokenOut 

Each of these rules thus serves to t^nppjinafr and trans- 15 
form some input or intermediate data objects to other 
int ermediate objects or output ohj ects. In this wav the rules 
are similar to a constraint-based description of the transfor- 
mation process. The collection of rules comprises a data 
flo w network which maps the in put str ucture nr schema to 20 
thft jyitpni structure/schema. Execution of the rules in the 
info rmation bridge then carries out the actual d^ a, transfor- 
mations. A rule binds to and acts on each instance, or 
combination of instances if the left hand side of the rule 
consists of multiple components, of input data which satis- 25 
fies its conditions. 

The left hand side identifies 'trigger conditions' for an 
invocation of the rule. Specifically, it may be token name(s) 
which occur in the input HLDSS, or else the result-label of 
some intermediate data object produced by another such 30 
transformation/coordination rule. There may be multiple left 
side pre-condition labels for a transformation rule. The left 
side may include some additional constructs that serve as 
filter or guard predicates on execution of the rule. The 
transformation action represents a functional application, 35 
and usually makes reference to a library of transformation 
operators, though it can include code fragments, much in the 
same spirit that Yacc and Lex admit code fragments within 
each production. The parameters to the function come from 
the data objects whose labels appear on the left side of the 40 
rule. 

When there is a new data instance for each of these tokens 
on the left, the rule is triggered. The resulting value or 
subtree instance produced by the rule is then given the 
output label that is designated on the right side of the rule 45 
specification. Specifically, the right side of a rule may be a 
simple data element or a complex data structure represented 
by a subtree. That is, the right side may be the label of any 
terminal node or the 'root' of any subtree/subgraph of the 
output HLDSS, or any intermediate data object that is to be 50 
utilized by another rewrite rule. Note that each label of the 
input schema tree must be distinct so that this designation is 
unambiguous. This is facilitated by adding a suffix to 
components of the HLDSS which otherwise would have the 
same label. 55 

Each rule is implicitly iterated for each distinct combi- 
nation of data object instances which satisfy the conditions 
on the left side. Thus multiple object labels occurring on the 
left side may give rise to a cross-product of data instances, 
each of which invokes the transformation. Alternatively, a 60 
vector-dot product of the instances corresponding to each of 
the multiple labels on the left hand side may be specified. 
The conditions and operations (modifiers) which may be 
used on the left hand side of transformation rule are 
described below in reference to the virtual parent concept. 65 

Some of the transformation operators provide rearrange- 
ment of data objects (e.g., permutation), aggregation of 
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objects which satisfy a given criteria into a collection, and 
the reverse process of separating a collection into individual 
objects. Other operators iterative ly apply a function to each 
member of a subtree (a collection is a special case), as well 
as operators which are a generalization of relational 
selection, projection, and join. There are also set operations, 
including cross-product and dot product of sets. Such sets 
are actually bags since duplicates are allowed by default 
unless explicitly removed. 

Normally each invocation of the transformation produces 
one data object — which may be complex and consist of 
substructure. However, the Collect operator collects mul- 
tiple data objects and produces an ordered bag (a sequence) 
of data objects, where the relative collection point is either 
explicitly specified or else is the common parent (relative to 
the input grammar) of the multiple left hand side data 
objects. In contrast, the Each operator takes a single com- 
pound data object and produces multiple outputs, one for 
each top level member of the compound data object. 

One of the important aspects of the internal representation 
of the invention is that it supports not only tree manipulation 
and tree rewriting operations, but also the application of 
extended relational operators. Consider that for each subtree 
in the LSD) schema tree where the subtree root is designated 
as a collection (with or to designate multiple 
occurrences), a collection of subtree instances is in the 
instance tree. A tree rewrite operation that is to be applied to 
this subtree can be iterated over each subtree instance. 

Alternatively, the first level children of a subtree may be 
treated as the (possibly complex) attributes in an n-ary tuple 
whose degree is the number of such children (as defined in 
the schema tree). Then the set of such instance subtrees can 
be seen to correspond to a set of tuples, with the label of the 
subtree corresponding to the name of a (virtual) relation. The 
operator is applied to each subtree instance, which corre- 
sponds to a tuple consisting of (possibly complex) first-level 
components. Extended relational operators thus can operate 
on each subtree instance by treating it in this way as a tuple. 
Note that the component of such a tuple may be complex, in 
that it may itself be a subtree, in which case a nested 
relational representation is being manipulated. 

Thus both the extended relational operators and the tree 
manipulation operators are applicable. Since a transforma- 
tion is applied to each subtree instance, the alternate inter- 
pretations as either a set of tuples or a set of subtrees are 
dependent upon the application and the type of transforma- 
tion rule. 

Some of the transformation operators which are in trans- 
form library 44 will now be described. The partial list begins 
with general tree and sequence rewriting operations, and 
then describes extended relational operators. 

Note that a transformation operator is iteratively (or in 
parallel) applied to each subtree instance. The appropriate 
subtree is indicated by the label on the left side of the 
transformation rule. Each result produced by the operator is 
given the label from the right side of the transformation rule. 
Operators discussed below which take a label as an argu- 
ment use such a label internal to their processing, e.g., to 
relabel internal subtrees, for example. 

The two sets of operators for tree manipulation operators 
and extended relational operators are described in turn. 
Following that, the aggregation and disaggregation opera- 
tors are described — which are utilized as modifiers on the 
left hand side of transformation rules, as needed. Iterate and 
rewrite manipulation operators include: 

Iterate(AggregateIn, Pred, ApplyFcn, NewLabel) 

Rewrite(AggregateIn, Pred, ApplyFcn) 
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These operators explicitly take advantage of the tree -like This function selects those tuples from the input relation 

structure of the internal representation — which is essentially for which the value of the i-th attribute matches with one of 

isomorphic to the LSD as described earlier. the attribute values specified in the selectionjceys. The 

Both operators walk the given input tree/sequence via output from this function is a relation or set consisting of the 

postorder traversal (root visited last), applying a function 5 tuples or subtrees which satisfy the above condition. This 

ApplyFcn to each node where the given predicate Pred is rule has some similarity to relational selection, but it focuses 

true. Iterate returns a collection of labeled list values from more on manipulation of the first level children of each 

the non-null function applications. Pred may be a user subtree instance. 

supplied predicate, or it may be an integer, a label name, or Extended relational join operators are of the form: 

a list of label names. When Pred is a positive integer i, this 10 Join (left_relation, right_relation, left attrib list, 

means that the condition is satisfied only at level i of the tree, right_attrib_list) 

where the root is level 0. If Pred is a label or a string of This transformation operator interprets the uniform inter- 
labels, the condition is true at a node of the tree if the node's nal LSD instance tree so that relational join operations are 
name matches the label or one of the labels in the list. When meaningful and easily carried out. The left_relation and 
Pred is false at a visited node the ApplyFcn is not invoked, 1S right_relation arguments specify the L3D subtrees each of 
but traversal continues into the interior nodes of its subtree. ™ ho f Ranees are treated as relational tuples consisting of 
For the Iterate operator, if NewLabel is non-null, then * e first ^vel children in that subtree instance. This opera- 
i ^ a i r. • • , * * i l i ii_ .i tion is applied to the collected set of instances of the left and 
eachvaluereturaedby ApplyFcn*^ ^ ^ ^ indicale which si _ 

result is a set of instances with this label-this set can be tional attributcs/components of me i eft ^ right 'tuples' are 

viewed as a one level tree or it can be viewed as a relation 20 to participalc m the equality-based natural join operation, 
with each such returned instance being a 'tuple'— with the comparison operation treats complex attributes (i.e. 

relation name appended as a label to each such tuple. mose ^ substructure) in terms of structure identity as the 

The rewrite operator is more general in that it allows basis of equahty matchmg. In effect this means that pointers 

rewriting and transformation of the input tree/sequence into mav be compared without recursively descending the sub- 

a more general output tree. Rewrite builds its output as a 25 structures. The implementation utilizes a hash join approach 

hierarchical tree-like structure by copying its Aggregated where a hash index is created for the tuples from tie left 

structure, subject to local rewrites at each node. Pred is relation based on its join key values, thereby reducing the 

treated the same as for the Iterate operator. potentially quadratic nature of a loop join into a process 

For the Rewrite operation, if ApplyFcn returns False the which is linear in the number of tuples, 
visited node is not copied; if it returns True, then the current 30 PredJoin(left_jelation, right__relation, left_attrib_Jist, 
node is copied as is. If ApplyFcn returns another value, then right_attrib _Jist, pred_Junction) 
it is taken as the value to which this node has been rewritten, This PredJoin operator is an extension of the above Join 
and thus replaces this visited node in the output of this operation. It extends the notion of join from equality match- 
Rewrite operator. In general, the result for the Rewrite ing as the join condition to testing of a user provided 
operator is a tree structure which is a projection of the 3S pred__function. This will fully accommodate less-than and 
structure of the original tree (i.e., some nodes may be greater- than joins as well as other join conditions. This 
omitted). The values for nodes of the rewritten tree may be operation is more costly because the hash join approach is 
transformed values or copies of the corresponding node of not applicable for an arbitrary predicate comparison of the 
the original tree. join keys. 

Extended relational operators include the following ^ JoinOuter(left_rel, right_rel, left_attribs, right_attribs, 
project/permute operators: outer_choice) 

Project(Permutation, SubTreeLabel) PredJoinOuter(left_rel, right_rel, lefL_attribs, right_ 

Pennute(Permutation, SubTreeLabel) attribs, pred_fcn, outer__choice); 

This operator permutes the top level members of each These two functions extend the prior join operations with 

subtree instance having the designated SubTreeLabel 45 the three outer join options, of LEFT, RIGHT, FULL, 

according to the specified Permutation order. The i-th entry meaning retain unmatched left tuples, or retain unmatched 

in the Permutation order specifies the relative index of the right side tuples, or retain all unmatched tuples from either 

initial entry which is to occur in the i-th position of the side. All the tuples actually are represented as LSD instance 

result. This may include repetition of members. trees, since they may include complex substructure for the 

The operator can be viewed in either of three related 50 attribute components, 
ways. It may be viewed as the traditional relational operator Grouping and ungrouping criteria for collecting input 

"project", in that each instance to which it is applied can be instances into subsets and for iterating over subsets are 

viewed as a tuple (possibly complex) of the relation named referred to as aggregation and disaggregation modifiers for 

by the second argument; it thus provides relational projec- the LHS (left hand side). The result of an aggregation is a 

tion onto the designated columns of the tuple. Alternatively, 55 partition of the instances, where each partition subset has a 

it may be viewed as SelectByPosition, since it selects the common value for the grouping criteria. This is analogous to 

components of the LabeledSubTree according to the indices the GroupBy clause of an SQL query, but is generalized to 

given in the first argument. Or it may be viewed as a apply to tree structured data. 

permutation, since it rearranges each first level sequence or A virtual parent, relative to the input tree, can be defined 

subtree of the second argument — it is iteratively applied to 60 as a means of providing grouping criteria on the left side of 

each such subtree instance at that level. These interpretations a transformation rule. This virtual parent may be the direct 

are equivalent operationally, but the interpretation is relative parent of the data collection, or it may be a prior ancestor of 

to how the user or integration administrator is viewing the the direct parent — in the latter case, the collection includes 

data. Thus there are synonyms for this operator as Project or a larger scope of data. This criteria serves to modify the 

Permute. 65 transformation rule to act on' the appropriate collection of 

SelectByMatch operator follows: data instances. Disaggregation operators can also be pro- 

SelectByMatch(selection_Jceys, i, relation) vided for the transformation rule. 
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The virtual parent is designated in the transformation The cross product rule first performs an implicit Collect of 

rules below by an UpTo(ancestor) clause on the left hand all instances of type Tokenl to form a set which is called 

side of a rule. If the UpTo clause is specified without an Tokenl+. Similarly, an implicit Collect is performed for 

argument, then the default is to utilize the grouping of data Token2 and Token3 arguments. 

relative to a common parent of the input. If all the LHS data 5 If the virtual parent in the UpTo clause is not specified 

has a common direct parent, it is used as the virtual parent. explicitly, then it defaults to the closest common ancestor of 

If the LHS data references span more than one parent label, these input types, relative to the input instance tree. Recall 

then the default virtual parent is taken to be the closest that the common ancestor of derived data refers transiuvely 

ancestor, from the input tree, which is common to all to ils origms ^ tne input instance tree, 

member instances of the collection. An explicit virtual 1Q ^ CrQSS Product then Ues the TransformFunction 

parent can be specified to be a higher level ancestor. separately to each combination of instances from each of 

TTiese LHS operators are optional, and when used serve as mesc coUected m ^ Toke nl + , T oken2 + , and Token3-K 

mooters to the condition on the left hand side of the ^ CrossProduct collccts these rcsults t0 produce onc 

transformation rule. output sequence, which consists of the sequence of results 

A Collect operator allows for grouping ; a ^ set of data i$ afising from every Cross . Product combination. The 

instances relative to a common criteria. The Collect operator CrossProduct is a predefined operator and can take an 

appears in a separate rule and must have an UpTo clause on afbitrary number of ^ aigument typeSt 

the left side of the rule; the form is: nt form of lbe CTOSS product mle simply omits 

Tokenl UpTo (virtual parent) =>Collect (Tokenl) :: the UpTo clause, which changes the meaning of the rale just 

TokenOut 20 to t ij e extent that no implicit Collect is performed for the left 

This rule will collect instances of Tokenl into a collection nand side tokens— the * serves to designate the rule as a 

which will then be given the label designated on the right cross product- In this case> eacn G f t he left hand side token 

side of the rule. Only one token type may appear on the left arguments is then taken to already represent a set, as may 

side of the rule. All instances of Tokenl under a single occur if a left side an^ument was a Node+ from the input tree 

instance of the virtual parent will be grouped into the same 25 or j f tne data ^ already been processed by an explicit 

collection. If the virtual parent in the UpTo clause is omitted, Collect rule. The cross product then proceeds as before, 

then the nearest ancestor to Tokenl which is an aggregate applying the TransformFunction to each combination of 

"plus" node, or the root, is taken to be the virtual parent. The values from the input collections—duplicates are not elimi- 

completion signal for this Collect rule is the input instance nated. 

tree node for the virtual parent. 3Q The iQ Qer pro duct is similar to the second form of the 

An Each operator can be thought of as the inverse of the cross product m that it takes multiple left hand sides tokens 

Collect operator. The form is: which each represent collections— except that these argu- 

CollectionToken ^Each (Tokenl) :: TokensOut ments are now treated as being ordered, and thus are 

This rule takes on the left side a label which represents sequences. The form of this rule is: 

one collection and produces multiple outputs, one output for 35 Tokenl . Token2 . Token3^Transformlunction ($1) :: 

each member of the collection. The input CollectionToken is TokenOut 

consumed only when the last member of the collection is The dot or ^od designates the inner product instead of 

processed. As for the Collect operator, only one token type mc * for cross pro d uc t. The first element from each of the 

may appear on the left side of the rule. ^ pui sequences are paired to form the first tuple or vector of 

The Merge operator combines several different input ^ data to which me TransformFunction is applied. Then the 

tokens from the left side into one sequence which is then TransformFunction is applied to the second tuple of data 

passed to the body of the transformation rule. The result is formed from the second component of each sequence rep- 

a single instance, which is given the label on the right side rese nted on the left hand side of the rule. The output consists 

of the rule. The form is: 0 f a sm gi e sequence arising from each application of the 

Merge (Tl, T2, T3) =»TransformFunction ($1) :: 45 Transform Function. 

TokenOut The dependency graph 200 shown in FIG. 4 is the basis 

The TransformFunction may be any transformation taking f or mie execution. It is implemented as a high level abstract 

one argument which is a sequence or list of elements. The $1 c j ass f or t he information bridge. It is composed of three 

represents the first component from the left side, which here submodules for the input tree 210, rule graph 220, and 

is the result of the Merge operator. 50 output tree 230, respectively. Although the term tree is used, 

The Expand operator is the reverse of the Merge operator. the input and output actually may be cyclic directed graphs, 

Expand takes a single argument which is sequence or list and though they most often are trees. Non-tree structures arise 

treats the first element of that sequence as the first argument when two or more leaf nodes are identical (corresponding to 

to the TransformFunction, and so on. Its form is: multiple parents) or when there is a directed cycle repre- 

Expand (Token) =>Transformlunction (SI, S2, S3) :: 55 senting a recursive structure, such as the contains or part-of 

TokenOut relationships that arises in applications such as bill of 

Note that the Expand operator produces one result materials. These three submodules are linked together to 

token — the result of applying the TransformFunction. This form the actual dependency graph data structure and asso- 

is different from the Each operator which produces multiple ciated functions, as shown in FIG. 4. The input tree feeds 

outputs. 60 into the rule graph which feeds into the output tree. 

The cross product takes multiple inputs on the left hand The input tree 210 actually consists of both a schema tree 

side of the rule and is distinguished by the * separating these and an instance tree. The input schema tree is constructed 

inputs. This rule has two forms, the first is: automatically from the input HLDSS and includes accessor 

Tokenl * Token2 * Token3 UpTo(VP) =>TransformFunc- functions which will retrieve and, if necessary, parse input 

tion ($1, $2, $3) :: TokenOut 65 data from multiple input sources. A node 212 in the schema 

Here VP represents a virtual parent of the input sets — this tree represents a named type of data element, much as an 

must be a common ancestor of Tokenl, Token2, and Token3. attribute in a relational schema defines a component of each 
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tuple. There also may be specific aggregation node(s) in a 
schema tree which represent a set, collection, or aggregation 
of subcomponents. Thus a roster may represent the set of 
students taking a course, or a database relation name may 
represent the set of tuples which populate that relation table. 

When the accessor functions retrieve data instances, they 
build an instance tree which has the same logical structure 
as the schema tree, but now with potentially multiple 
instances nodes for those schema nodes which are subordi- 
nate to an aggregation node in the tree. There would be one 
input instance aggregate node for each occurrence of a 
collection — say each course — and one data node for each 
member of an aggregate (e.g. each student). Similarly, to 
represent a relation in a relational database, there would be 
a separate schema node for the relation itself, for a generic 
tuple, and for each attribute type. In the instance tree there 
would be as many tuple nodes as there are tuples, and each 
tuple would have one instance node for the value of each 
attribute of the relation 

Input data instances are first inserted into the input tree. 
The data instances are then passed by the input tree to the 
rule graph, together with completion signals. The rule graph 
220 will apply specified transformation rules on the data 
instances to generate intermediate instances and output 
instances. The final rules of the rule graph insert the output 
instances into the output tree 230 which assembles and 
eventually outputs them accordingly. Thus dependency 
graph 200 forms the implementation structure which enables 
data driven processing of the input data instances. 

This processing is asynchronous and thus is amenable to 
parallel processing. Intermediate data tokens arising in the 
rule graph are released as soon as they are no longer needed. 
Elements of the input instance tree also could be released 
once all their dependencies have been processed. 

The asynchronous processing is supported by two major 
dependency coordination schemes in the dependency graph: 
(1) The blocking and unblocking scheme in the rule graph 
and in the output tree, and (2) a completion signal scheme. 
Currently input data instances can continue to be acquired 
even when the rules they feed are blocked — this could be 
changed easily to conserve intermediate memory if desired. 

The rule graph 220 is a directed graph and consists of two 
types of nodes — (intermediate) data nodes 222 and rule 
nodes 224. Each rule node is connected to one or more input 
data nodes and no more than one output data node. Several 
rules may interact in that the output of one or more rules may 
feed the input to one or more other rules, as shown in FIG. 
4. This is consistent with the fact that each HLTRS rule (rule 
node) can have multiple input labels/data and at most one 
output label (for one data node). Multiple other rules may 
utilize this rule's output, and this is accomplished by having 
a data node feed multiple HLTRS rules (rule nodes). 

The left-most data nodes in the rule graph are connected 
to input schema nodes of the input tree. The right-most rule 
nodes in the rule graph are connected to the output schema 
node in the output tree. The rule graph thus is a part of the 
data flow diagram in the dependency network. 

A blocking and unblocking scheme is used to ensure a 
smooth and proper flow (i.e., no overflow) of data instances 
in the rule graph. Each rule has a one queue for each of the 
rule's inputs. When the rule is fired and it creates output, the 
rule is blocked until the output is consumed by another rule 
or by the output tree — in the case of a collect rule, discussed 
below, the rule does not produce its output until the collec- 
tion is completed. The implementation considers a rule 
blocked when its output node is blocked, which indicates 
that the result produced by the previous execution of the rule 
has not yet been consumed. 
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Once the rule is unblocked, the rule may be triggered 
again when there is at least one data element in each of its 
input queues. A rule is considered activated once it has been 
triggered and before it has completed execution. Activated 
rules are appended to a working list of activated rules, and 
can be executed in the order in which they were activated. 
The activated rules could also be executed in parallel on a 
multiple processor platform. 

When a rule is executed it utilizes one data instance from 
each of its input queues. The implementation utilizes one 
physical queue for each rule's output. This queue may feed 
one or more other rules — in the latter case, all the consumer 
rules have logical queue pointers into this one physical 
queue. The rule creating this output is considered unblocked 
when any of the logical queues of subsequent rules is empty. 
The actual data instance is removed from the physical queue 
only when all logical queues have utilized this data instance. 

The output tree 230 actually consists of both an output 
schema tree and an output instance tree — this is similar to 
the input tree. The output instance tree may have potentially 
multiple output instance nodes 232 corresponding to a given 
schema node. 

Rule nodes may feed output schema nodes and/or other 
rule nodes. As instances pass from these rule nodes to the 
output schema nodes, each schema node is responsible for 
creating new output instance nodes and inserting them 
properly in the output instance tree. If the data instances are 
atomic then they are inserted into leaf nodes in the output 
instance tree — the relevant parent and ancestor nodes are 
created automatically. Alternatively, rules may process and 
produce sets as well as data instances with substructure, in 
which case such non-atomic data may be inserted as a 
subtree in the output instance tree. 

Transformations may be expressed as a set of transfor- 
mation rules — though in the limiting case, one rule could 
produce the full output tree. A transformation rule or a chain 
(composition) of transformation rules establishes a corre- 
spondence or mapping from one or more leaves and subtrees 
of the input to one or more leaves or subtrees of the output 
instance tree. 

There also is a blocking and unblockin &^g^e me on the 
out put tree , and this is co6?dmated~with the rule^graoh. 
Blocking/unbloclang on the output schema nodes will cause 
blocking/unblocking on the rule nodes in the rule graph. 

In the output tree the blocking and unblocking scheme 
serves two related purposes: 

$ An output instance subtree with disti nct children node s 
m ust txLCompleted before output data is inserted for th e 
nRYt instant^ o f that subtre e. This is ensured by block- 
ing introduction of new instances of a different subtree 
(i.e. having a different subtree root) until all compo- 
nents of the current subtree are complete. When all 
member instances are present the current subtree is 
marked complete. This unblocks the subtree root for 
creation of another subtree instance, a H f r"» ^hiHrnn i'p 
that new subtree then mav iW i 1vp "»*p"t p^ta As an 
example, a new relational tuple cannot be started until 
the previous tuple's data is completed. 
A data collection/set/aggregate must be complete before 
processing data for the next occurrence of this data 
aggregate — this is somewhat similar to the first case, 
but here it is referred to as a set of data in which there 
may be one schema node type representing the multiple 
instances in the set. The criteria is enforced by building 
a dependency relation between the aggregate schema 
node and the schema node representing members of the 
aggregate. Creation of a new aggregate is blocked until 



03/31/2003, EAST Version: 1.03.0007 



5,970,490 

23 24 

all instances of that set are complete — as determined by i are needed. Rule graph nodes also filter out multiple occur- 

the completion signal scheme. | rences of the same signal, such as may arise when a rule has 

The subsequent material describes the generalized ! multiple inputs. Only the last such duplicate signal is propa- 

GroupBy criteria, cross product rules, and some of the other \ gated onward. The number of duplicate signals is deter- 

generalizations of nested relational operators that have been ! 5 m ined during analysis time and this count is used to deter- 

developed for data integration. It also goes into greater detail j mmc tnc duplicate. 

regarding the signaling conditions which help to implement A more detailed discussion regarding the parsing and 

the above subtree and set completion catena «4 recognition of heterogeneous data resources will now be 

The completion signal scheme augments the blocking and addressed Such m and re c 0gn iti 0 n of source data are 

unblocking of rules and is used to determine when a set or w a lished b ^ modul which in ^ cases 

collection of data is complete. Two ditierent cases arise. A - » j j .u * * * j u .u ~ «-f 

parent node may have several distinct children nodes, each ° f generated code tha is created by the earlier modules of 

having its own different label. That parent node and its set me system. Specifically, die HLDSS preprocessor, which 

of children are considered complete when there is an P* 1 ** and ""erprets the HLDSS specification, is respon- 

instance for each child node. Subsequent instances would be sible for generating such code and tailoring it to the anno- 

associated with a different parent. 15 tations given m the HLDSS. 

The second case arises when an input or output tree node This code generation process creates the LSD schema tree 

has an explicit set indicator, which is referred to as a "plus" and the modules (e.g., parsers) which access data and build 

node in the HLDSS because the syntax is Node+. This the LSD instance tree. This generated code also performs the 

represents the parent node of the set or collection, and the necessary data conversions from the external form to the 

type of the children which are the members of the set are 20 internal values — as dictated by the Type specifications 

indicated by a schema node labeled Node (without the which are also part of the full HLDSS (not shown in FIG. 3). 

'plus'). The Collect and Product operators in transformation FIG. 3 shows a logical structure diagram which accesses 

rules depend on knowing when their input sets are complete. different data resources. The multiple overlapping ovals 

The output tree also depends on knowing when any output surrounding nodes marked with a "V signify that multiple 

nodes representing a set or collection (i.e. designated as 25 data instances correspond to this node for a single parent 

"plus" nodes) are complete. instance. 

To determine such completion a logical association is The nodes of the LSD are annotated with the annotations 

made with each data instance in the input tree of a list of the that are given in the HLDSS, together with derived anno- 

instance nodes of all its ancestors. For each set or collection tations that are determined during processing. One important 

operator in the rule graph, and each set or collection node in 30 category of derived annotations is that of uniform regions, 

the output tree, one of these ancestors is selected as a virtual discussed below, which determines the access and parsing/ 

parent. The virtual parent provides the basis for the collec- generation of data. 

tion. In general, the virtual parent is the closest ancestor The schema tree represents the logical structure of the 

common to all the member instances. data. The structure of the schema tree, specifically the 

When it is determined that a collection is complete, 35 parent/child/sibling relations, are represented by the produc- 

designated member instance nodes relative to one instance tions in the HLDSS file. The right-hand side elements of a 

of the virtual parent are collected The next instance of the production are children of the left-hand side. And that 

virtual parent, if any, provides the basis for the next instance left-hand side, in turn, is a child of the production in which 

of the collection. In effect the virtual parent serves as a it appears as a right-hand side. All elements that appear 

delimiter or boundary between different collections of 40 together on the right-hand side of a production are siblings, 

instances. For data derived in the rule graph, the virtual The type specifications, specifically integer, float and 

parent refers transitively to the common virtual parent of all string, describe the data types of the actual instance data, 

the data instances from the input instance tree that gave rise The annotations and spec lines describe details of how the 

to the derivation of this data. data is stored. Note that this is independent of the structure 

The virtual parent is designated in the transformation 45 of the schema tree. Agiven schema tree can map back to data 

rules below by an UpTo(aocestor) clause on the left hand that is stored in very different ways, like a flat file, a web 

side of a rule, as noted earlier. The virtual parent serves as page, a relational database or an object oriented database, 

a form of generalized GroupBy criteria. If the UpTo clause but share the same logical structure, by the use of different 

is specified without an argument, then the default is to retain annotations for the input and output, 

the grouping of data from the input. That is, the default 50 How the schema tree is actually built will now be 

virtual parent is taken to be the closest ancestor from the described. The schema tree consists of "schema nodes," 

input tree which is common to all member instances of the where each node corresponds to a right-hand side element in 

collection or set. an HLDSS production. An example of part of an HLDSS 

This analysis is done during code generation for the specification is given earlier when discussing FIG. 3. 

information bridge, where the dependency graph is preana- 55 Each node has attributes such as a label and a node type, 

lyzed to determine which ancestors are needed. The node's label is determined by the corresponding ele- 

Furthermore, in the implementation each data instance with ment name in the HLDSS file. For terminal nodes, their type 

its relevant ancestors is not directly annotated, rather it is corresponds to the type of their data, i.e., string, float, 

created as a separate virtual or invisible data node for each integer, etc. for non-terminal nodes, their type describes 

instance of these virtual parent ancestors. Each such virtual 60 what kind of non-terminal it is. Non-terminal nodes can 

data node is referred to as a completion signal and it is either be left-band side nodes, or "plus" nodes, 

propagated through the rule graph to the output tree. Thus Multiple instances of a data type, such as multiple rows of 

when a collection of input data is complete, a signal repre- a database table, in which each row has the same format, are 

senting the parent of that collection is inserted in the specified by using the extended BNF plus (*'+") operator to 

dependency network after its set of data. 65 indicate "one or more occurrences" of that node. Manifest 

During the analysis phase, code is inserted in the relevant 112, element 140, and node 150 which appear in FIG. 3 are 

rules and output schema nodes to detect the signals which three examples of plus nodes. The number of instances of a 
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"+" node can be optionally specified, either as a fixed 
constant or arithmetic expression, or it can be specified by 
a variable in the HLDSS file, such as element and node in 
the previous example. If no bound is specified, then the node 
type is recursive. 

In generating the schema tree for the information bridge, 
each non-terminal that appears on the left-hand side of a 
production in the HLDSS file gives rise to a corresponding 
non-terminal node in the schema tree. The children of that 
rule-node are all of the nodes that appear on the right-hand 
side of the aforementioned production. Any of these right- 
hand side nodes that are non-terminals appear as the left- 
hand side of another production. This process proceeds 
recursively. 



When there is shared or recursive substructure, the LSD 
schema becomes a more general graph rather than a tree due 
to multiple parent nodes and/or cyclic references. 

A subtree is complete if there is a unique node which 
serves as a root, and such that the subtree includes all and 
only those nodes reachable from the root via the directed 
edges. A subgraph is complete if it contains a complete 
subtree which spans all the nodes of the subgraph. 

Uniform regions partition the set of terminal nodes, and 
thus the set of actual data, into homogeneous subsets, each 
of which may be accessed by a single access mechanism and 
parser/recognizer. Shared substructure is represented by an 
LSD instance subgraph which has two or more parent nodes. 
If the LSD schema graph is cyclic, this represents a recur- 



3~ co mpilation process basically is the distinction betw een I 
analvsisfcode generation) and execution. I f an interprete d 
language su ch as Lisp or Smalltalk is used, then the op era- 
tion~distihction between these phases would not be as 



notice a ble, as an mtern rete^ language can both analyze and 
nd then proceed to execute that code alLuT 



o The motivation behind this two step code generation ancHLS sively defined data structure. The corresponding LSD 

instance graph need not have such cycles since each com- 
ponent could be represented by a separate copy of its 
instance subgraph — unless the instance data is also shared. 
Thus multiply referenced substructure in the LSD schema 
10 graph does not automatically imply shared substructure in 
the LSD instance graph — it is just the potential for sharing 
which is indicated by an explicit annotation in the HLDSS 
and in the LSD schema graph. 

Thus a uniform region is a complete subtree or subgraph 
corresponding to data from a single data source — such as a 
relational DB or object DB. A structured file may consist of 
different uniform regions if, for example, portions of the file 
are delimited ASCII strings, and other parts are binary data. 

Note that a subtree is not uniform if any of its terminal 
nodes is from a different data resource than any other 
terminal node of that subtree/graph. Thus two or more 
adjacent subtrees of a common parent may be from the same 
data resource, but need not be part of a single common 
region if another descendent of the common parent of these 
subtrees is from a different source. In the latter case, some 
optimizations are possible when two or more uniform 



ge nerate code and 
the same proces s. 

For the two step code system, the first phase analyzes the 
HLDSS and creates the description of the LSD schema tree 
and the necessary data structures. The second phase utilizes 
the description that was generated in the first phase to 
actually construct the data structures in the executing infor- 
mation bridge process. This represents two distinct process- 
ing phases, specifically the "bridge generation" and "bridge 
execution" times, respectively. Since C (and C++) are com- 
piled languages, they necessitate a separation of these 
phases. Interpreted languages, such as Lisp, TCL/TKor Perl 
provide an "eval* operator that allows (source) code gener- 
ated by a program to be interpreted and executed by the . 
program that created it. pS 
There are two viable alternative solutions when using a 



language that does not provide such a facility. The first is to ^regions of the same type are adjacent in the LSD. 



create an internal language and interpreter for the interme- a 
diate representation and use that for the description and r 
interpretation. The drawback of this approach is that the 40 
internal representation is limited by the language that is 
created to describe it. As system requirements change or 
expand, this representation, as well as the interpreter, must 
be changed to allow new features to be added. The second 
option is to generate code in an existing language, compile 
the code, and then execute it. The draw back is that it 
requires a separate compilation and execution phases, which 
represent separate processes. However, the benefit is that 
since the generated code is the same language as the 
generator (C++ in this case), it can be directly used and has 

A^all the expressive power of the underlying language (C++). 
JV The HLDSS parser generates C++ code, which is then 

/ co mpiled in the first pha se. The compiled code, when 
executed, creates the actual data structures for the schema 
and instance trees when the information bridge is executed 
in the second phase. " 

A uniform region is essentially a contiguous set of data 
from a common data source such that it is uniformly 
parsable, e.g., tuples from database relations, or binary data 
from a file. 

More formally, a uniform region is defined as a complete^ 
subtree or subgraph within the logical structure diagram, all 
of whose terminal nodes are uniformly parsable and from a 
single data source, and such that there is no larger containing 
complete subtree/subgraph which qualifies as a uniform 
region. Terminal nodes correspond to actual data, and non- 
terminal nodes correspond to structures and aggregates. 
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An individual or atomic field, relative to its data source, 
may have further decomposition specified for it by the 
HLDSS. As an example, in a relational database, if a field is 
a binary large object, then it would appear as a single atomic 
field in the relational tuple, but it may have further substruc- 
ture and thus would be decomposable by additional parsing. 
This would be indicated by an explicit annotation on the 
production that defines the substructure of that field, indi- 
cating the substructure to be applied — this may be text 
parsing of a comment, or binary data parsing of graphic or 
video data, for example. This would not change the nature 
of the containing uniform region (i.e., as a relational data- 
base region) since the data comes from the same data source 
and the substructure is of a single data component or field 
value. 

In the system, the term 'parser' is used more generally 
than when parsing text or languages, since it can mean data 
access involving structure decomposition (one might say 
structural parsing) and the conversion of raw external data 
(e.g. ASCII and binary data) into internal data values such as 
integers, reals, text, etc. A distinct data access function or 
'parser' is associated with each uniform region. If data is 
from two different files, it is easy to see that two different 
parsers may be appropriate, especially if the files have 
different data formats. 

Parsers can be organized as object types according to the 
taxonomy of data resource types. An instance of a parser 
object type is associated with a specific uniform region in the 
LSD schema. That parser instance is activated whenever 
data from that uniform region is to be accessed or created. 
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Different uniform regions will have different instances of the 
same parser object type when these regions have the same 
type of data source and same type of data representation — 
e.g. objects in an object database. Utilizing different parser 
object instances, even of the same type, captures the possible 
differences in data sources (e.g., two different ASCII files) 
and the different states those parser object instances may be 
in when (an occurrence of) their region is complete. 

Thus a given uniform region is characterized by a data 
source, a parser type, and a parser object instance; and this 
information is attached to the unique node in the LSD 
schema tree/graph which is the root of this uniform region. 

The process of Determining uniform regions and associ- 
ated parsers for the LSD tree (or graph) serves to cover all 
terminal nodes with distinct parsers. The root or distin- 
guished node of each uniform region now has a parser 
method associated with it which can recognize (or create) 
data in that region. 

All nodes which are outside some uniform region are 
designated as parser controller (PC) nodes. These nodes are 
parents of either uniform regions and/or other parser con- 
troller nodes. Whereas the behavior of each parser method is 
specific to the data source and data format of its uniform 
region, the parser controller process is general and may be 
invoked as a reentrant procedure for each PC node in the 
LSD. 

So far the schema of the LSD have been described. 
Corresponding to each node of this LSD schema, there will 
be several instance nodes, all being instances of this type of 
LSD schema node. For the root nodes of uniform regions, 
each associated instance node represents a different instance 
or occurrence of this uniform region in the actual data. 

The instance nodes taken together form the LSD instance 
subtree/subgraph. Another interesting way of thinking about 
the LSD schema tree is that it provides a kind of pattern 
which is matched against the structured data. As such 
matching proceeds, instance structures are assembled in 
accordance with the structure of this pattern — when taken 
together, these instances form the LSD instance subgraph. 

The operation of the parser and parser controller nodes 
during data access will now be addressed. The process is 
described sequentially, although the process is designed to 
support parallel execution, subject to data dependencies. 

The data access and parsing process begins with the top 
or maximal node of the Logical Structure Diagram — if the 
LSD is a tree, this is the natural root; if the LSD is a graph, 
then this is a designated node such that all other nodes of the 
LSD are reachable as descendants of this node. 

Each child node then is visited in turn, proceeding from 
left to right for sequential processing. Parallel execution 
would be supported by processing independent subtrees in 
parallel. Limited dependencies between subtrees could be 
approached by message passing between the respective 
parallel processes for these regions. 

Each child node that is visited is either a parser controller 
(PC) node or is the root of a uniform region. Each PC node 
in turn visits its immediate children in left to right order, and 
each returns a data structure which represents a correspond- 
ing data instance tree. The parser controller assembles these 
instance subtrees from its children into an instance tree for 
this PC node. When all its children have been processed, this 
tree is returned to the caller of the PC node. 

This process is the same for all parser controller nodes, 
except that the top node of the LSD returns its instance tree 
for subsequent processing. If the LSD schema has cycles, 
each cycle is treated as recursive invocation of the subgraph 
pattern as the basis for parsing, until the data is exhausted. 
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In principle, a uniform region could be as small as a single 
terminal node, or as large as the whole LSD. 

Thus the execution of the parser controller nodes provides 
the postorder traversal of the tree down to the level of 

5 uniform regions. Then the associated parser for the particu- 
lar type of uniform region is invoked. It traverses the nodes 
of the uniform region, and constructs a corresponding 
instance subtree/subgraph. The uniform region parsers typi- 
cally are specialized, and interpret the LSD schema tree 
relative to the specific annotations of that region. The nodes 

1 and edges may be treated differently in different regions, and 
even different edges and nodes within a region may have 
different meanings based upon the annotations. In contrast, 
each parser controller node treats its immediate descendants 
uniformly, and this applies recursively downward until a 

15 uniform region or terminal node is encountered. 

The operation of specific types of uniform region data 
access functions, referred to as parsers will now be 
addressed. All relational database parsers share substantial 
functionality, and this is accomplished by the definition of a 

20 relational parser object type. Then specific relational data- 
bases are handled by creating associated parser object 
subtypes, such as for Oracle, Sybase, and Informix database 
^products. 

The methods associated with the relational parser object 

25 type include: 1) connecting to the database server, 2) open- 
ing the database, 3) issuing the query and opening the 
associated virtual or actual relation (the result of a query 
which joins relations is a virtual relation that is materialized 
by the database system), 4) iteratively accessing (retrieving 

30 or inserting/updating) each individual tuple and advancing 
the database cursor, 5) committing the transaction (for 
inser^pdate) to end the access, and 6) closing the database 
and disconnecting from the server. Steps 3 and 4 may be 
repeated for each of several queries before the subsequent 

35 completion steps are initiated. For some system interfaces, 
connecting and opening may be a single combined 
operation — in this case, the specialized object subtype for 
this interface would have a null method for step 1, while the 
method for step 2 would perform the combined operation. 

40 The relational parser object type initiates a method to 
open the database (step 2) when the uniform region is about 
to be processed. If the connection has not been made to the 
database server by a previous uniform region, then a method 
for step 1 is executed. Following issuance of the query, 

45 individual tuples are retrieved — this is actually step 4a, as 
step 4 consists of two parts. Step 4b iterates over each 
attribute field of a retrieved tuple to create another terminal 
data instance node for the instance tree of this uniform 
region. When the fields of a tuple have been processed, then 

50 step 4a is repeated to obtain the next tuple, until all tuples 
have been processed. 

After all tuples for a query have been processed, or that 
query is otherwise exited, step 5 commits the transaction, 
thereby allowing the database to release access to the 

55 underlying tables that serviced this query — typically this is 
needed only for multi-step insert and update access but not 
for simple query retrieval. Step 6 to close the database is 
executed only when the system has determined that no 
further access to this database is needed, or at the end of 

60 processing the LSD and all data for it. 

These generic relational database operations are further 
specialized for Oracle, Sybase, and Informix parsers, and for 
other relational systems, by specializing each of the above 
methods for the specific syntax and operations required by 

65 the DBMS. 

Both relational and object database parser types are 
themselves subtypes of the * database' object type in the 
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system. This serves to abstract from the above methods face to a subsystem or a software package. Relational 
those operations which are in common among all databases, database access, discussed above, is an example of an API, 
such as generic initiation of a connection and opening the though it is treated separately because of its special nature, 
database, establishing a query or access pattern, accessing Other API interfaces include HDF and netCDF data files, 
data instances, completing the access, and eventually dis- 5 which are best accessed through the software interfaces 
connecting from the database. P rovlded for them ; ^ ^ mterf aces arc characterized by 

O For ob£ct databases, query-based access can be used to «* ™ ™re open/connect procedures, procedure invocation 
' establish a focus within the OODB. Then traversal of the ? speafy rekyan data, one or more data fetch operaUons 
, . . . , . , , . • „f (e.g., all the data at once or piece at a time), completion and 

object instances leads to the particular data instances of ^ ^ ^ of each API interface need to 
interest Each OODB system has its own query language- 10 bc ^ S £ mcatcd b fragments, though some higher 

and some have do query language so that only navigational fcvel specification fe possible. The program fragments which 
access is possible. In general, OODB query languages are may he needed shou]d be relatively small and confined in 
moving toward some form of object-SQL. scope, as they would be concerned with the local details of 

The navigational access operations are specified either as ^ API— the parsing and decomposition of the resulting 
a program consisting of method invocations, or preferably as 15 ^ a j a WO uld be handled by the system based upon the 
a sequence of path descriptors to traverse the OODB HLDSS. 

instances. The latter can be described declaratively in terms API interfaces may be for local (same process) interfaces 
of OODB schema edges and nodes — such as by a path or for external process interfaces via remote procedure call 
expression. Some nodes along these paths are retrieved, (RPQ. The latter includes interprocess communication in 
updated, or inserted, while other nodes are intermediate 20 the same computer as well as inter-system communications 
points for the access. to other computers in both LAN and WAN (e.g. internet) 

After the query is , executed and the objects have been networks. When RPC is utilized, the data arguments of the 
accessed, the connection to the database may be closed when invocation as well as data returned from the call are encoded 
the system determines that no other access to that database automatically by the XDR protocol into a linear stream of 
will be needed or upon completion of all interactions. 25 data for transmission over the network (e.g. via TCP/IP). 

Structured files, databases, and other data resources are Thus the arguments and returned values may be complex 
treated uniformly — the same HLDSS specification language data structures, though some restrictions apply, 
and LSD internal representation applies to all heterogeneous When complex structured data is returned by an API 
representations. The differences are dictated by the annota- interface, the decomposition or parsing of this data structure 
tions. 30 need not be done by a specially written program, but can be 

Structured files are common as the import and export uniformly processed by the system, the same as is done for 
media of commercial design tools. There are two types of other data resources. HLDSS specifications can refer to 
uniform regions for structured files: ASCII and binary complex structured data in main memory (e.g. as returned by 
regions. ASCII uniform regions include string and numeric an RPC invocation), by specifying the appropriate annota- 
data in 7-bit ASCII representations. A binary region consists 35 tions. 

of values which are 8-bit data (which includes control Basically, the right hand side (RHS) of an HLDSS pro- 
characters). Both ASCII and binary uniform regions are duction defines named components of the data structure 
processed sequentially from left to right by the respective referenced by the left hand side (LHS). Any of these 
type of parser object instance. components may involve further substructure by specifying 

For an ASCII region, data may be delimited (by 40 another production with that component on the LHS. Arrays 
"* whitespace or specified delimiters) or may consist of a would be accessed as a sequence of elements whose length 
specified number of characters, where this length may be is the product of the dimensions of the whole array or of the 
determined by the value of other data which precedes the specified array slice. 

variable length string. Numeric ASCII data may be repre- A linked list would be represented by a production with a 
sented in standard base ten notation or, for integers, in 45 linked list annotation. Trees would be represented by a set of 
hexadecimal or octal notation (hex begins with "Ox" and p roductions, one for each subtree, and may be recursive, 
octal has a leading zero). The number of elements in a Whe re the LHS represe nts a node and t he RHS represents £Ke 
sequence or repeating group of elements may be determined , dutdren. Jb'or linked lists and cmldren ol a tre e nodeLthe 
from the value of prior data, by encountering a different data numBeT oTcomponents may be unspecified, in ttuVcase th e 
type, or else by rinding that the next data characters match 50 ' ^""oT^ 7 *" notation in the HLDSS tor the repeating com - 
a specified pattern (regular expression) — the alternative is po nent de signates one or mo re, ana zero or mo re 
specified by the HLDSS. occurrence fi, rcsprrtutely. 

A binary region is processed by a different type of parser ^ Moregeneral pointer-based structures would represented 
which recognizes 8-bit data and treats field lengths in bits V s * by an annotation on each RHS token that represents a 
rather than bytes. The length may be fixed or may be 55 structure that is accessed by dereferencing a pointer — this 
specified by prior data or terminated by data matching a p er-token * pointer' annotation is different from the annota- 
given pattern (regular expression). tion at the far right side of a production w hich rele rs to tie 

Another type of data resource is the World Wide Web. whole producHo n! ' "~ "** 

Each individual data source is specified by a Uniform Several user interfaces may be used. All are subordinate 
Resource Locator (URL). A URL can be thought of as a 60 to and accessible from the main interface which has been 
universal file reference across the entire internet. When an nicknamed DAISy, for DAtabase Integration System This is 
annotation specifies a URL, the data resource is accessed an X windows interface and it represents the Interoperability 
and then the retrieved data is decomposed or parsed in the Assistant and Toolkit. It provides a great deal of support for 
same manner as for an ASCII file. Embedded hyperlinks building and executing information bridges along with other 
may also be processed by the system. 65 functionality. Consequently, it is intended to be used by both 

Application Programming Interfaces (APIs) consist of the Integration Administrator as well as by general users of 
one or more procedures or methods that serve as the inter- information bridges. 
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A control file approach is used to store the pertinent 
information needed to create and/or execute an information 
bridge. The control file contains information about the 
locations of the three specification files, the location of the 
information bridge, as well as pertinent information relating 
to data sources and data targets. An Integration Administra- 
tor can create a new control file or, in many cases, modify 
a similar control file on-line and save the modified file under 
a new name. 

Two types or classes of users of the system include: 

1) Integration Administrators: these are the systems spe- 
cialists who understand the data requirements in detail 
and use the system to create information bridges. They 
need to have a detailed understanding of the specifica- 
tion language. 

2 ) Users °f me information bridges who wish to (a) 
°y transform data from one or more data sources into 

different representations on different systems, and/or 
(b) browse the combined data from p ossibl y sever al 
sources inde pendently of th ese sources ana flEeF native 
re presentatjonj s). These users need not have much 
familiarity with the system to use it effectively. 
One of the capabilities available to the user is to dynami- 
cally view the progress of execution of an information 
bridge during the data transformation process using a moni- 
tor to show the status of each of the three separate processes: 

1) Accessing the source data and building the input 
instance tree, 

2) Applying the rule transformations, 

3) Building the output instance tree and sending the 
transformed data to its intended destination. 

These three processes are designed to execute asynchro- 
nously. 

An interactive browser has been developed for v iewing 
andta ^ersinglroth the schema and the data instances arisin g 
from the heterogeneous databases an d design files . UnliJce 
the^external data viewers aoove, with the browser of the 
present invention the representation of data it presents is 
uniform regardless of the kinds of data source(s) involved. 
This representation is based on logical structure diagram 
(LSD) formalism described earlier. 

The browser provides a homogeneous presentation of -j 
both the schema and data instances from multiple hetero- 
geneous databases and structured design files. The repre- 
sents an important aspect of integration since the user need 
not be concerned with the differences between data models, 
databases, and other structured data representations. With 
this feature in an information bridge, the ability to graphi- 
cally represent data in a uniform manner without relying on 
particular data models is provided. 

It should be noted that such browsing may be done on 
either 1) schema and data prior to transformation — so as to 
preserve the original logic of the application; and/or 2) after 
such data transformations are completed — the latter thus 
includes the semantic integration which has been done 
through the transformation rules. The browsing mechanisms 
are the same. The ability to browse the data from either of 
these two viewpoints provides useful flexibility. 

The schema shown in FIG. 3 is displayed by browser 
schema 300 in FIG. 5, the browser providing both horizontal 



be expanded to their full subtrees at any time during the 
browsing, as controlled by the user. The browser also may 
be switched between instance and schema displays at will. 

The Browser example in FIG. 5 shows that the data values 
at the leaves of the schema are displayed. Data is associated 
only with the leaves, while the tree structure represents 
relationships in the source and/or target databases. Notice 
that only one set of related data values is shown at a 
time — that is, one data value for each leaf node in the LSD 
schema tree. 

The set of menus and options include: 
File Menu: 

Select a different LSD schema and data tree to browse, 
Create an additional browser window, or close a 
browser window. 
Tree Menu: apply to both schema browsing and instance* 
browsing 

Contract Marked Node to replace/elide the subtree 
under that node (nodes are marked utilizing the 
mouse), 

Expand Marked Node one level. 
Expand Marked Node to all levels in its subtree. 
Unmark the marked node. 
Browse Menu: options to 

Switch between browsing the schema versus browsing 

the data instances, 
Show Next instance, Show Previous Instance 
Show First instance of a subset/subtree, 
Show Last instance of a subset/subtree, 
View Menu option to switch between vertical display 
(root at top and leaves at bottom) verses horizontal 
display (root at left and leaves at right). When 
browsing the data instances, the user may go to the 
next or previous instance relative to the current 
instance and a designated pivot node. The pivot 
serves as the point at which the notion of next and 
previous is applied. Thus the Next Instance operation 
advances to the next instance of the pivot node, and 
in so doing will advance each of the immediate 
children of that pivot node — with this effect propa- 
gating down to the leaves. Thus as the data values are 
moved through via user commands, some or all leaf 
nodes will show their data values changing. 
The sequence of values for the different leaf nodes is 
defined as a logical 'tuple* since there is a single value in 
each position at any one time. That is, a tuple is defined as 
the sequence of data values, one per leaf node, that fall under 
the tree scope of a selected node, which is referred to as an 
'anchor* node. 

kjL Using this notion of a logical tuple, a more compact 
50 representation of data values in tabular form, showing 
multiple such logical tuples at a time, is provided. To do this, 
the user selects a subtree of the LSD tree by selecting as an 
anchor node the root of this subtree. The tabular view then 
will display in a scrollable subwindow the sequence of 
tuples arising from the leaf nodes under the root of the 
selected subtree. There will be one tuple for each combina- 
tion of leaf node data values. 

Another new capability in the da ta-model-indejjenden t 
brows er is th e u se of queries to select subsets of the overal l 
60 d auTqrarciess of their source, and t hereby filter out unre- 
l ated information in order to focusTon selec jfidjjata. From 
the user's perspective, performing a selective query is simi- 
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and vertical views for schema and data. ^ , , „ . 

In addition to browsing the full schema and full instance"] ^\lar to using the tabular browsing functionality just described 



trees, the user al< ^ yap ry>nim| hnw t he schema and instance 
trees are displayed. T he user may contract specific subtrees 
s t> as to focus^ oh portions of the schema and/or data that are 
of interest and remain in view. Such contracted subtrees may 



In addition, the user must specify the de sired qualifying 
65 c ogdition on a leaf jiode to uti lize this functionality . 

To specify a condition, the user selects a leaf node and 
then selects the query parameters option from a query menu. 
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A dialog box then allows specification and editing of the 
qualifications for the node. 
jS*) A stmcture editor for the HLDSS specification has been 
developed wrnch interactively ensures that a correct speci- 
fication is created. It does this by type checking ea£h 2 toJcen 
ma lthejise^ ente rs into the edito r. lLa lso provides the u ser 
with templates of specificatio n constructs, and i t provide s 
fee^b^Oo^trienKeT'ffsno^rie' allowed components at any 

point'irTthe-Sp ecificatign : — — ^ 

Users interact with the HLDSS structure editor using 
commands and transformations. Commands include self- 
inserting characters, character and structure deletion, etc., 
ana may be executed througrillse of the key sequences that 
are bound to editor commands or mrougrTlnT^^s. A 
common form of transformation expands a placeholder with 
its own template, which is an outline of some syntactic 
construct containing placeholders for constituents of this 
expanded template. ^~ 
Ase mantictiata specification language. (SEMDAL^ inc or^ 
porates feature s^for management of metadata for database s, 
s tracnTraiTiaTaTfocumen t structu r e, and collections oJLmu l- 
tim edia documents a s well as wer>6^ 3jn formation.^d eta- 
dat j^de^nptive"lriformati^ n ahmit | dajajtslaji ces. This 
met adata ma y pro vide the scfr e m ^ stnicture ^n-which the 
nce_data is stored apd^orjaiyieDtedlJ'be metadata also 
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may provide a wide variety of other information to help 
interpret, understand, process, and/or utilize the jjaja. 
inst ances . At times, metadata may be treated as if it were 
instance data, depending upon the application. 

Both Standard Generalized Markup Language (SGML) £o 
and the high level data structure (HLDSS) of the present 
invention are subsumed by SEMDAL. SEMDAL includes 
frame based structures, inheritance, creation of groups of 
frames, etc. A subset of SEMDAL can be transformed into 
the HLDSS format. This enables SEMDAL to be applied to 
database interoperability issues for structured databases 
(relational, object-oriented, hierarchical, etc.) as well as for 
structured files. 

While a subset of SEMDAL translates into the HLDSS, a 
somewhat different but significantly overlapping subset 
translates into SGML and thus offers a different surface 
syntax. Thus, it is the interpretation associated with the 
syntax that is essential. Normally, SGML is interpreted to 
refer to document structuring with tags (markup) and 
HLDSS refers to databases and structured data. However, ^ 5 
with SEMDAL different interpretations may be associated to 
a SEMDAL construct to achieve either of these results. This 
facilitates bridging the gap between structured databases and 
free form text and documents. 

It is important to be able to make the interpretatiooVso 
explicit and to be able to apply alternative interpretations '] 
within a s pecification language . The interpretation is defined j 
as the operational processing which is to be applied to J 
different c onstructs of the syntactic lan guage, both in terms / 
of the logical meaning as well as the proceaures which are j 55 
to be applied for a particular application. The following is an 
example of a Frame construct: 

<! FRAME manifest :: header bound nodes numNodes; 

<!ATT numNodes type=integer, 
value-8 » 60 

Note that the ATTributes are scoped within that Frame 
that attributes defined in the !ATT portion are semantic 
attributes, and that the names on the right side of the ": an s 
structural components or structural attributes. Any represen ■ 
tation language includes in its definition some implicit 65 
interpretations. What is important is that the set of interpre 
tations which are being described are explicit so that differ- 



ent interpretations can be captured within the same language 
by specifying the relevant interpretations rather than needing 
different languages for each. 

In other words, if an ATTribute is added in the above 
Frame with the INTERPretation that this is a 'document' 
then the right band side (after ": :") will be interpreted as 
components of the document, consisting of a header, a 
bound (read as an 'abstract'), textural nodes (e.g., 
paragraphs), and a count of the paragraphs. When this 
interpretation is specified, the IFrame construct is very 
similar to the ! Element construct of SGML. 
^ This is one aspect of how the SE MDAL lan g uage is 
mappe d into SGML. Also, the lexical scoping of the Frame 
construct is not directly paralleled by SGML. The Frame 
construct allows the same attribute name to be used differ- 
ently in different frames due to the lexical scoping and is 
similar to an object definition. However, if each attribute 
name is used uniformly in all Frames, then the Frames of 
SEMDAL can be translated into the ! Elements of SGML and 
the ! ATT of SEMDAL becomes the ! XTTList of SGML. J 

If, however, INTERPretation is specified as a 'relational 
database', then in the previous example Frame, manifest is 
a relation which has fields/attributes for header, bound, 
nodes and numNodes. In fact, what is referred to as the 
'phyla' in the HLDSS for database interoperability is essen- 
tially what is meant here by the interpretation. This is what 
determines whether the HLDSS statement is to be treated as 
a reference to a relation, an object, or part of a parsing 
specification for a structured file. 

Of course certain semantic ATTributes, such as type, will 
have meaning under just one INTERPretation or the other. 
Thus the INTERPretation attribute can be used by a reason- 
ing or processing engine to determine how the other 
attributes and constructs of the specification should be 
interpreted and utilized. 

The notion of interpretation enables the use of the same 
representation to express a matrix. This is one of the novel 
aspects of how interpretation is used in the SEMDAL 
language to unify and express seemingly diverse constructs 
such as structured documents, relational and object-oriented 
databases, and matrices. 

In the following example, a 2-dimensional matrix is 
defined with a value that is called a WindVector. The 
structural components of WindVector are defined in a second 
Frame. "MatA" is a matrix because its INTERPretation is 
"matrix". This means that the right most structural compo- 
nent is the value at each entry of the matrix, and the other 
structural components are the dimensions, thus it is a 
2-dimensional matrix. Hence the dimensions and value 
attributes could have been deduced. 



<!FRAME 
<!ATT 



<!FRAME 
<!ATT 



MatA 
MatA 



(Lat I Lon) 
Tunc 

WindVector 
(X 1 Y 1 Z) 
Speed 



Lat Lon WindVector ; 
[NTERP - matrix, 
dimensions - 2, 
value = SWindVector; 
Units °> degrees ; 
Units = seconds > > 
:: XY Z Speed ; 
Units o meters; 
Units - meters/second > > 



The notation "$** means that the value of WindVector 
rather than the string/name 'WindVector* is the value at each 
element of the matrix — this value consists of the 4-tuple of 
structural components defined in the WindVector Frame. 
The notation (Lat 1 Lon) allows semantic attributes which 
are common to multiple components or attributes to be 
defined concisely in a single expression; similarly for the 
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Units of (X 1 Y 1 Z). Of course, this example is not the only 
way of representing the direction and speed of a wind vector. 

Though it is not necessary in this case, the two frames 
above could be grouped together by a grouping clause with 
documentation such as: 



<!GROUP WindMatrixA 
<!ATT WindMatrixA 
WindMatrixA 



Members » (MatA, WindVcctor) 
Documentation = "documentation string"» 



In view of the above, matrices can be represented in a 
relational database by a relation defined for each matrix and 
having an attribute/field for each of the N dimensions, and 
another attribute for the value — or several attributes if the 
value has substructure. Then each tuple would contain the N 
coordinates followed by the value at that coordinate. This 
could be space efficient for sparse matrices. 

N-ary relationships are also achieved as another interpre- 
tation of a Frame. Note that the !ATTribute specifications 
within a Frame define only ternary relationships between the 
component, the attribute name and the value for that 
attribute. For example in <!ATT Time Units=seconds>the 
component is Time, the attribute is Units, and the value is 
seconds. 

An interpretation of a Frame as a 'relation 3 provides the 
left hand component (before the ": :") as the name of the 
relation and the right hand components as the components 
that participate in the relation. Normally a single construct 
in other specification languages could not represent both the 
substructure of a document and an n-ary relationship with 
the same kind of specification statement, but here this is 
accomplished easily through the novel use of an INTER- 
Pretation. 

Thus, an n-ary relation may be defined by a Frame as in 
the following example: 



<!FRAME Supplies 
<!ATT Supplies 
Cost 



Supplier PartNo ShopLocation Cost ; 
1NTERP = relation; 
Currency IN (US-dollars 1 Canadian- 
dollars), 

type=integer > > 



20 



The semantic attribute INTERPretation as a 'relation* 
means that this Frame represents the Supplies relationship 
between Supplier, PartNo, ShopLocation and Cost. The 
Semantic attributes also indicate that Cost is given as whole 
dollars (integer) in either US currency or Canadian currency. 
In the !ATT clause either "=" is used to set a default value 
or an "IN" comparator to constrain the value to an enumer- 
ated set of possible values or a range. An attribute also can 
be defined without specifying its default value. Other com- 
parators would be expressed as constraints. If it is desired to 
specify a constraint by "IN" as well as to set a default value, 
two expressions would be used, for example: 

<! ATTCost Currency IN (US-dollars 1 Canadian-dollars), 
Currency=US-dollars> 

Note that the above INTERPretation as a 'relation' is not 
the same as a relational database, since here it has not been 
committed how this 4-ary Supplies relation is to be mate- 
rialized and stored. This relationship could be, if one 
wanted, represented by Horn logic clauses in a prolog 
system. 

Logical variables for both values as well as for names in 
SEMDAL are now introduced and it is shown how con- 
straints can be expressed in terms of these variables. Vari- 
ables can be denoted as ?XX where the"?" indicates that XX 



is the name of a variable, and XX may be any alphanumeric 
string where the leftmost position is alphabetic and capital 
letters and lower case letters are treated as different. The 
variable may stand for either a value or a name of any term 

5 in SEMDAL, that is, the variable may stand for the name of 
a Frame, the name of a structural component (on the right 
side of the ":: "), the name of a semantic attribute or a 
semantic (sub)attribute of an attribute (a slot), etc. 

The meaning of the variable is designated by setting it 

io equal to the components involved. Use of the wildcard "*" 
can refer to multiple components, which may be further 
restricted if desired by a predicate that the variable also must 
satisfy. Thus valid expressions for defining variable include: 

1. Supplies. Cost.value=?Cv, which is equivalent to 
is Supplies.Cost«?Cv. 

2. Supplies.Cost.Currency=?CR 

3. Supplies.Cost.type=?Ty 

4. Supplies.Cost.*=?Cattr 

5. Supplies. {*}~?Aset 

6. Supplies.* Components- ?Scomp 
The first expression above defines the variable ?Cv to be 

the value of the Cost component in the Supplies frame, while 
the second and third expression define the logical variable 
?CR and ?Ty to be the values of the semantic attributes 
Currency and type respectively. The fourth expression 
defines the variable ?Cattr to be one of the attribute names 
in the set of attributes of Cost, while the fifth expression 
defines the variable ?Aset to be the set of attribute and 
component names of Supplies — the curly braces { } indicate 
that the set of attributes rather than one attribute is intended. 

The last expression defines ?Scomp to be one of the 
immediate structural components of Supplies, that is one of 
the components defined on the right side of the ": :** name 
Supplier PartNo., ShopLocation, and Cost. Note that the 
reserved expression * Components refers to any one of the 
structural components of a frame, whereas the in "Fra- 
me Name.*" refers to any of the attributes of the frame, be 
they structural or semantic attributes. The notation {*Com- 
ponent} would refer to the set of structural components. 
Similarly *Semantic refers to any one semantic attribute and 
{^Semantic} refers to the set of semantic attributes. More 
generally, predicates in the form of additional constraints 
can be used to limit the set of possible values for a variable. 

Constraints are a very important part of metadata for the 
following reasons: 

1. Constraints express logical consistency among different 
schematic constructs and definitions when they are 
interrelated in the real world. 

2. Constraints provide an important means of expressing 
semantics involving multiple objects, 

3. Constrains express the requirements for data consis- 
tency within complex databases and between different 
sites that contain interrelated data values. 

4. Constraints can express conditions and predicates 
which may signal important events. 

5. Constraints are used to make the active metadata 
repository respond to changes and events and to initiate 
actions dynamically. 

6. Constraints go beyond the trigger mechanisms that 
have been and are being introduced into commercial 
database systems. 

All constraints and expressions involving variables and/or 
involving operators other than and "IN" by separate 
<! Constraints . . .> statements, each of which is referred to 
as a constraint set. This will lead to a 'constraint-rule' 
package that will be developed. 
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The ! Constraints set can be lexically within a Frame, or 
can be in a ! Group clause, or can stand alone outside of a 
Frame — the latter is appropriate when constraints interrelate 
components of two or more frames. The Constraints set is 
named and may contain multiple constraints separated by 
and ended by the closing ">". 

The six expressions listed above which define logical 
variables can be seen as constraints, since the variables are 
constrained relative to values and/or names of components 
and attributes. Since some variables may refer to names and 
other variables may refer to values, one can have expression 
such as 

<!Constraints Cstl ?Cattr.value=?CV> 
which defines a constraint set name Cstl and means that the 
variable ?Cv can stand for (be bound to) any of the values 
of those attributes whose name can occur for the variable 
?Cattr. This constraint involves two variables. Other con- 
straints may further limit the actual choice. 

More generally, constraints do not need to involve vari- 
ables. For example, the following constraint statement: 

<! Constraints Cst2 Supplies. Cost< Supplies .CeilingPrice 
Supplier.KindOf=SmallBusiness> 
constrains the Cost of a supply item to be less than or equal 
to the CeilingPrice of that supply item, and it constrains the 
kind of Supplier to be a small business. 

The dotted expressions, such as Supplies.Cost, are path 
expressions, and the value of the component on the right- 
most side of that path is the value which is being con- 
strained. A suffix of ".value" could be added to the right side 
of each path for the same effect. 

The way variables may be defined for constraints may 
also be generalized. While the six constraint expressions 
above defined each variable in terms of its allowed set of 
values, another form of constraint makes the value set 
implicit in terms of how the variable can be bound, thus a 
constraint such as: 

<!Constraints Cst3 Supplies. ?X. value-? Y> 
binds variable ?X to any component or attribute of Supplies 
which has a value, and it binds ?Y to the value of that 
component. 

The translation of SEMDAL to a logic language such as 
KIF (Knowledge Interchange Format) has yet to be devel- 
oped for reasoning and consistency analysis of a set of 
semantic constraints. The notion of constraint packages with 
enforcement actions to maintain data consistency among 
distributed heterogeneous databases also needs to be further 
developed. 

One of the significant advances made by the present 
invention is the seamless ability to store diverse forms of 
metadata in a single repository utilizing a single logical and 
physical storage scheme. The differences between metadata 
representations as currently used are accounted for via the 
operational interpretation of the uniformly stored informa- 
tion. Furthermore, alternative syntactic realizations of this 
metadata are provided so as to continue to support those 
applications which expect different syntax for different uses. 

The logical design level of the metadata repository allows 
for multiple implementations and is convenient in different 
storage architectures. The physical storage of one imple- 
mentation is relation-based in order to be accessible with the 
Java Database Connectivity (JDBC) standard for the Java 
interfaces to relational databases. Note that the choice of 
repository implementation is independent from the data 
models being described. 

The metadata repository of the present invention encom- 
passes both structure and semantics, and it does so in a way 
that can accommodate separately developed and different 
metadata within a primary framework. 



0,490 

38 

The main structural representation focuses on hierarchical 
tree-structures as the dominant structuring mechanism. It 
also accommodates full directed graph structures through 
cross-referencing among nodes in the spanning tree of the 
5 g ra P n - 

The semantic representation consists of three levels: 
Semantic Frames, attributes, and subattributes or slots, 
Reference to agreed upon semantic features and values to 
take advantage of existing terminology and ontologies, 
io and 

Use of logic-based and rule-based expression of seman- 
tics to supplement agreed upon declarative semantics 
with areas where standards are not yet evident, and to 
provide executable definition of such semantics. 

15 At the structural level, the primary characteristics of 
SGML (Standard Generalized Markup Language), HTML 
(Hypertext Markup Language), and the new evolving XML 
(Extensible Markup Language) have been subsumed, as well 
as the heterogeneous database structures — including rela- 

20 tional and object-oriented models, and to provide extensi- 
bility to address multimedia data. 

Although the existing representations for structured 
documents, databases, and semantic knowledge representa- 
tion are rather different on the surface, unification at the 

25 logical level has been achieved. The common framework, 
then, admits alternative syntactic presentations — the need 
for which has heretofore accounted for the superficially 
large differences between different specification languages. 
Furthermore, this syntactic diversity has been capitalized 

30 on to admit yet other alternative syntactic expressions for 
both output as well as input, thereby enabling the develop- 
ment of specialized mini-languages for particular applica- 
tions while retaining a common underlying semantics and 
uniform internal representation. Such syntactic diversity has 

35 been accommodated by providing for an input parser and an 
output formatter to realize each such specialized mini- 
language. Both the parser and formatter communicate with 
the common logical representation. 

It is tempting to refer to this logical representation as the 

40 'internal * representation, but in fact this single logical rep- 
resentation could have its physical storage implementation 
in a relational database, or an object-database, or in a 
hierarchical or network database architecture just as easily. 
A relational storage implementation has been chosen 

45 because of its accessibility through the increasingly popular 
JDBC application programming interface. 

The metadata representation of the present invention shall 
be referred to as the Semantic Metadata Description 
Language, or SEMDAL and is also referred to as MDS 

50 (MetaData Specification) for short. It is a frame-based 
representation where a MetaFrame contains potentially mul- 
tiple attribute value specifications. Each attribute may have 
subattributes, and these too may have subattributes if 
needed. This creates a potentially hierarchical array of 

55 descriptors. A simple example of such descriptors are 'units* 
and 'precision' information for length, weight, or other 
measures. 

The potential hierarchy of semantic descriptors, and the 
actual substructure of the stored data are distinguished as 

60 described by the MetaFrame 's metadata. Such data substruc- 
ture is explicit as in SGML, and is denoted by a 'Substruc- 
ture* attribute (similar to SGML's MElement'), or by using 
a syntactic shorthand similar to BNF (Backus- Naur Form) 
grammar specifications. 

65 The shorthand: Aggregate :: comp f pnentl, compo- 
nent, . . . , stands for: Aggregate .substructure«componentl, 
component2, ... .1 The right hand side admits alternation 
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("|** representing "or") between subsequences of terms. In 
this case, the right side must consist only of one level of 
alternation of subsequences, and thus parenthesization is not 
used. 

Suffixes "?" for zero or one, for zero or more, and 
for one or more are utilized. This is very similar to SGML's 
! Element construct when used to define non-terminals. For 
some component or aggregate A, the semantic attribute 
A.tagged="start end" or "start" is used to represent SGML's 
tag minimization scheme of " — " for start and end tags being 
required, "-0" for start tag required but end tag optional. The 
default tagging is neither start nor end tag is required — it is 
thus necessary that the target syntax be parsable unambigu- 
ously for it to be acceptable. 

For tree structured (and of course flat) data, this grammar- 
like representation for substructure is quite appropriate. For 
full graph-structured data, such as may occur in object- 
models and in the network data model, a node may have 
multiple predecessors — that is, a node may participate in 
multiple aggregations or collections. If a copy of the node is 
not sufficient because the shared aspect must be captured, a 
cross-reference of the form #node is used to create the 
multiple references to the common node — this is analogous 
to the use of #tag references in HTML. So for example, if 
both B and D are to represent the same identical structure, 
to write: 

A:: B C D(-#B) 

where "#B" represents the identity of "B". An equiva- 
lent form of expressing this is: 
D .substructure. IID-B .substructure. ! ID, 
where "! ID" is the system's identifier for the sub- 
structure. Simply to initialize some component E to 
be a copy of B's substructure, could be written as: 
E.substructure=COPY [B.substructure], 

where COPY is a structure copy operation on the 
value of B. substructure. 

Thus for graph-structured data, the spanning tree could be 
created, and then cross-references to correctly represent the 
shared nature of multiple references to a node could be done. 

The underlying logical model for the metadata repository 
thus is a hierarchical structural model that can represent the 
spanning tree over a graph or directed graph structure. The 
logical model for semantic attributes provides a hierarchy of 
subordinate descriptors — this semantic hierarchy is separate 
from the structural hierarchy. Each semantic descriptor can 
be represented as a whole, with operations to retrieve the 
associated value, the parent descriptor which is refined by 
this subattribute, and the set of subattributes of the current 
descriptor. 

This features of the logical model can be represented and 
stored in an object-database or a nested-relational database 
(there are few nested relational databases). It can be conve- 
niently represented in a traditional relational database 
through the use of keys which reference the subordinate 
hierarchical levels. Similarly, the structural hierarchy is 
reflected by tuples in a different relation, with similar ability 
to find the parent aggregate and/or the component children 
of a non-terminal node. Thus each level of the semantic 
hierarchy and the structural hierarchy are represented by 
tuples in relational tables, and references to subordinate (or 
parent) levels is effected through the relational join opera- 
tion. 

The basic group is a Metaframe, which is named. The 
MetaFrame is started with "<MetaFrame", every completed 
line ends with semi-colon ";" except that this is optional for 
the last line. The MetaFrame is closed with "/>" or, option- 
ally with "/MetaFrame>. 
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MetaFrames are not syntactically nested. A MetaFrame 
contains semantic attributes and sub-attributes (slots) which 
may be nested arbitrarily deep. A MetaFrame also contains 
structural components which may themselves be aggregates 
5 of other components. 

Thus to describe a Clock-Schema MetaFrame, would be 
written: 



<Meta Frame Clock-Schema ; 

Clock ::Face Hands ; 

Hands ::HourHand MinutcHand SecondHand ; 

/MetaFramo 



The representation for an analog watch then can refer to 
15 Clock-Schema.SecondHand. Length. Units to define the units 
in which the length of the second hand will be expressed. 
Then a simple number in the actual data representation will 
have meaning that is explicit. Such units is semantic infor- 
mation and is part of the metadata that is managed, and it is 
20 necessary information in order to properly utilize the actual 
length data values — which may be stored in a separate 
database of instance data. 

Note that the abbreviation used above "::" represents the 
'Substructure* attribute, and this is a syntactically conve- 
25 nient way to express: 

Clock. Substructure=*Face Hands 
Note also, that multiple values for an attribute are 
allowed, where ordering is important. 

Uniformity of representation in SEMDAL is achieved by 
30 recognizing that seemingly different data models and rep- 
resentation formalisms differ mostly in terms of how they 
represent information, and, in fact, there is much overlap 
between data models. The superficial differences often 
involve implicit assumptions in a data model, such as the 
35 implications of syntax. These assumptions are made explicit 
in the SEMDAL model. 

For example, the following is written: 
SI [AB CD] 
^ it could be defining: 

1) a relational table called SI, or 

2) a sequence of A ... D, and naming that sequence SI, 
or 

3) a set SI of elements A . . . D where order does not 
45 matter, or 

4) an object class SI with data members A ... D, or 

5) a four dimensional array called SI. 

Each of these are referred to as different interpretations of 
the same syntactic representation. SEMDAL provides a 
50 system level attribute 'INTERP* to capture this 
INTERPretation, indicating, for example, whether the data 
is from a relational database table or from the data members 
of an object. 

Note that hierarchical structure of the data is defined using 
55 "Substructure" attributes or its abbreviation as"::". In 
contrast, subattributes, such as "units", do not define sub- 
structure of the data — rather these subattributes provide 
additional metadata to help interpret the data which is 
presented. 

60 In the semantic attribute specification above for units, the 
expression "Qock-Schema.SecondHand.Length.Units" is 
an attribute path expression and the 'dot' separates each 
'attribute term' of the path expression. Thus "Clock- 
Schema", "Second-Hand", "Length**, and "Units" are each 

65 attribute terms. Such a term can be the MetaFrame name 
(leftmost term if present), a structural component, or seman- 
tic attribute or subattribute (slot) to any depth. 
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To indicate that all "Units" are in "cm" centimeters, a represented. The name of the MetaFrame is MDS1 — often 

pattern-based attribute expression may be used to indicate the name of the frame may be given to be the same as the 

this as: first leftmost component (here 'Manifest'). Note that "::" is 

* Units-" cm" an aDDrev i a ^ on f° r substructure. 

The here refers to an arbitrary number of higher level 5 

terms in the path expression, and says that for any compo- <MetaFrame mdsi ■ 

nents or attributes where Units is appropriate, the Units to be * Man if cst ' :: headcr+ bound* Photo ; 

used are centimeters. Other path expression patterns include: header :: Name Version Type Fmt ; 

Foo.Bar * .Unite-to indicate only Unite of attributes any . „_„ :: »™El«m«ts ™Nod«. scaieX scaler ; 

L . - „ J m bound.INTERP » "ODB ; 

depth under Foo.Bar 1U b ound.s ys tem - -Ontos" ; 

Foo.?.Units — to indicate only Units one level under Foo. bound.osQL - "Select Mesh.bound ..." ; 

If an attribute or component is to have several attributes hcadcr • INTERP ^ - ; 

specified, say Length and Width, may be abbreviated as slT™ - "@dbioo.ram:42" ; 

follows, where the alternatives are enclosed in parentheses Access - "Fred" ; 

and are separated by "|" the alternation symbol: 15 (Name|Version).Ciypc*- M string", Length~40) ; 



HourHand.(Length | Width).Units«"cm" which means 



Version.Default -"1.0' 
Photo. (Typc-"Imagc", Encode- TP EG", Sizc-480, Locatton-"NYC, 
HourHand.Length.UnitS-"cm" TitIe="Metadata Architecture...", Date="6/5/97*, SequenceNo-34, 

HourHand. Width.Units="mm" size.Units ="tB") ; 

Beginning a line with a dot is a form of elision, which 20 /MeteFramc> 

refers to and includes the parent attribute path that appeared 

on the previous line — that is, all attribute terms except the The ODB above has substructure consisting of a set of 

rightmost term. As an example, this defines values for header information — the indicates one or more occur- 

Length and Width using " ." ellipses: rences of this information, similarly for bound. Thus the 

object database contains a set of bound information, while 
the relational database contains a table with all the header 

HourHand Length = 3.5 information. 

• width - 8 In the list of semantic attributes, the dot before System, 

"~ ~ ~ ~ — ^ — Access indicates that "header" prefixes each of 
In general, when multiple attributes refer to the same prefix „ me ^ Q ^[ component on the previ- 

of an attribute path expression, may be written: 30 B °*\ ^ ™* Length ^formation are the same 

. „ , it ' . _ . for Name and Version. Note that the alternation symbol 

Name.(Type- string", length-40, default- Omega ); ^ ^ 5etwceQ attribute names mat constitute different 

Note that for rightmost terms in an attribute path, a comma attribute path expressions _ m tne ^ ^ as in SGML, 
separates each attribute-value pair. multiple attribute values are assigned, as for Type and 

Similarly, if the same attributes apply to multiple 35 Length, a comma is used as the separator to denote a list of 

components, may be abbreviated as: attribute-value pairs, as for the multiple attribute values for 

(HourHand | MinuteHand). (Length. Units="inch", Photo. 

Width.Units="mm"). Note that the Size attribute for Photo has a semantic 

Here both the HourHand and the MinuteHand are being subattribute indicating that the size is in kilobtyes. In 

assigned semantic attributes as to the units for Length and 49 general, any attribute may have subattributes, and these in 

Width. Note that only one alternation set (separator"!") mav tum mav a ^ so navc subattributes — providing considerable 

appear in an attribute path expression, and only one attribute extensibility as more of the implied knowledge of an appli- 

value list (comma separator ",") can occur in a path — both cation is made explicit. 

can occur in the same path. Attributes such as INTERPretation are system defined 

Data typing is indicated by A.type attribute, for some 45 attributes. In order to avoid name conflicts as the application 

attribute path A. Also, A.default specifies a default value, designer creates attribute names, system attributes may be 

and "A.value=or just "A=»" indicates an (initial) value. A list prefixed with the "!" exclamation mark, as **! INTERP". 
of tokens may be given for a semantic attribute, thereby The logical structure described for the Metadata reposi- 

creating an ordered multi-valued result. Each value token tory is hierarchical, which could be naturally implemented 

also may be treated as a semantic attribute which can have 50 in a 'nested relational' database or an object database, as 

further attributes. So for example below, "Methods" has a well as in other databases. Due to the increasing value of the. 

pair of method names as its value, and here one of these JDBC connectivity between Java programs and relational 

values, "Rotate", itself has the semantic attribute of "Args": databases, a relational implementation has been chosen. 

Object. Metbods=Rotate Zoom; Thus the Metadata Repository will be accessible via the 

Object. Rotate.Args=HAngle VAngle; 55 JDBC standard API. The metadata which are stored describe 

indicates that Rotate and Zoom are Methods of Object, and elated instance data— which usually is stored separately in 

that the Arguments of Rotate are HAngle and VAngle. One one or more repositories, such as in a relational or object 

could instead have written "ObjectMethods.Rotate.Args- database, document repository, or multimedia digital library. 
HAngle VAngle" to make "Methods" explicit, but only one Eacb hne of the MetaFrame needs to be represented, and 

alternative (i.e., with or without "Methods") can be used 60 nested suhattributes as well as explicit substructure need to 

within a given specification in order to provide consistency. be provided for. A relational table MetaFrameAttributes is 

One may then refer to "Object.Rotate. VAngle", etc. Note defined to represent the primary semantic information from 

that the system attribute ! Multiplicity automatically will be each attribute line of the MetaFrame. The columns of 

given the appropriate value (see page 10), so Object.Rotate- MetaFrameAttributes are: 
^rgS.IMultiplicity will yield the value 2. 65 MetaFrameAttributes 

In the following example, both an object database (ODB) [FrameName AttributePauV Value ParentAttributePath 

and a relational database (RDB) as well as a photo are LocalAttribute DataType Multiplicity] 
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The full attribute path includes the Frame name — if the The substructure portion of a MetaFrame consists of those 

Frame is unnamed it is given a default name of "lUnnamed". lines, typically placed first in the MetaFrame, which have 

For the purpose of storage, the FrameName is listed in a following a token — or an attribute path ending in 

separate column for indexing, thus the AttributePaUY in the substructure — these are often referred to as structure pro- 

MetaFrameAttributes relation does not include the Frame- 5 ductions. To represent this information in a relational 

N amc database, to create a relational table MetaFrameStructure: 

Each of the attribute components which are bounded by a MetaFrameStructure [FrameName LHS RHS Occurrence 

dot* delimiter on one or both sides is referred to as an Order Alternative] 

•attribute term'. Thus the attribute terms in "Auto . Engine . whe [ e LHS is the left hand side of the or the parent 

NumCylinders" are "Auto", "Engine", and "NumCylin- 10 f ttn ? u f^ ^ substructure, and RHS is one of the right 

ders'\ The AttributePath includes the attribute terms from s * de components. There is one such tuple for each nght 

left to right as a single string, including the Mot' delimiters. h ( and «^e component. Order is the ordinal position, starting 

^ £ 7 .*.. ». l- . t_ . at one, and taken from left to nght, of this RHS component. 

The I^cal Attrfcu e : is mis nghtoc^Uttnbute term m an 0ccurrence fc the ff ag for mat RHS : either «r, 

attribute path. The Parent AttributePath is formed by ^ « + „ ? 0f blank _ re p r esenting respectively one, zero or 

removing the LocalAttnbute from the AttributePath . For is more ^ OQe or morCj or single occurrencc . The first three 

top level elements, the Parent is shown as null, which means columns of the MetaFrameStructure table are separately 

that the Frame is the parent. Several of the entries have been indexed. 

expressed separately for indexing purposes — so that one can The Alternative field defaults to "1" unless this structure 

easily find all components of a Frame, all children of an production contains two or more alternatives, indicated by 

attribute path, the parent of a path, and all local attribute 20 "|" to separate each alternative. In this case, the production 

names. must consist only of alternatives at a single level, and thus 

The DataType is the data type for the value of this with no parentheses. All terms within the first alternative 

semantic attribute. The DataType may be specified in the would have "1" in this Alternative field, with the Order 

MetaFrame explicitly as in <attribute-path>.iype or it will starting from one for the leftmost term; the second alterna- 

be deduced (guessed) by the data entry subsystem— when it 25 tive would have "2" in this Alternative field, with the Order 

is explicit, there will be MetaFrameAttributes tuple for the again starting from one for the leftmost term in this alter- 

.Type. The value is stored as a string, and the DataType native - If there m multiple productions with the same left 

information is used to coerce this string value to the appro- haod ^ ^y are stored the same way as in the "|" case 

priate language type as needed. ^ the Alternative number determined by the order of 

Multiplicity is set to 'one' if the value consists of a single 30 presentation, 

token or if double quote marks are around the whole value, Note that ^ RHS ' S whlch do not a PP ear as a LHS m 

thereby making it a single string. Otherwise, if a subattribute 'terminals' in the sense of the substructure 'grammar \ All 

" ValueSeparator" exits, it is used to parse what is given for RHS>S wmch do a PP ear as a UiS m 'non-terminals' and do 

the value into the actual multiple values. Else whitespace is not generally correspond to individual data values in the 

used as the separator to parse the value and determine the 35 database being described, but rather are placeholders for the 

Multiplicity. Upon retrieval of this information, the subat- substructure. 

tribute ! Multiplicity (where "!" is required) is materialized Astructure path expression may begin 1 with a FrameName 

from this Multiplicity field, rather than being stored and and exists of dot-separated terms, each term being a LHS 

retrieved as a separate attribute path tuple. System attributes or mS > such that RHS must correspond to that LHS. If 

of the form"! 1" and "!<integer>" are also provided, where 40 mis LHS also a PP ears a RHS > me structure path may 

the is required, to reference the first through nth value in C0Dtmue a corresponding LHS. Thus in the MDS1 

a multi-valued result-giving n>!Multiplicity is an error. example on page 8, the structure path "Manifest.bound- 

If elision were used in the initial representation, the -scaleX" could be used to refer to this structure, the asso- 
elision is expanded initially and the full attribute path cialed value > ™ weU as olher semantic attributes of 
expressions are utilized for entry into this relation. The 45 scaleX. Also, a structure path expression ending in a non- 
entries in all Frames without names are treated as if they termiDal can be used t0 rcfer to medata structure instances) 
were in a single unnamed Frame. Attribute paths for other corresponding to this non-terminal— each instance itself 
named Frames may be specified in an unnamed Frame. bein S a data structure consisting of the multiple data values 
Order of entries is not important within a Frame, though by aad their associated structure, as described by the 
convention all substructure information precedes semantic 50 MetaFrame productions. Structure path expressions can be 
attribute information used m a S eneric uniform query language to refer to data m 

Wildcard expressions in an attribute path include special an V form of structure or database, based upon the metadata 

symbols such as " for zero or more attribute terms, or m me re P° sltor y- 

for one attribute term. These wildcard attribute path expres- In some systems, such as relational databases, the basic 

sions are entered directly in the table as is, and also are 55 M& definitions and some related information are stored in 

expanded for all existing attribute paths, with each expan- ' svstem catalogs' which themselves are relational tables 

sion entered separately as well. When new attribute paths are which niay be accessed by the user in readonly mode. All 

entered, they will be checked relative to the existing wild- such data mav * accommodated in the structure as shown 

card expressions. When new wildcard expressions are so far. An example of this land of catalog information, taken 

entered, they will be checked relative to all existing attribute 60 from Informix, is: 

paths. Wildcard expressions involving"*" may be qualified, SYSTables [tabid, tabname owner rowsize ncols 

by naming the as in "*x", and including a qualifying nindexes . . . ] 

expression, as in Car.*x (|x |<3, x l="Level?") SYSColumns [tabid colno colname coltype collength ... ] 

.NumCylinders. The qualifying expression, within the "( These are represented as the following MetaFrames: 

says that up to 3 attribute terms may occur, and that the 65 <MetaFrame SYSCatalogs; 

first/leftmost must match the given string pattern, i.e., must SYSTables :: tabid tabname owner rowsize 

begin with "Level" followed by one more character.2 nindexes . . . ; 
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tabid colno 
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colname coltype 



SYSColumns 
collength . . 
(tabid | tabname | owner) .type»"string"; 
(rowsizc | nindexes).type="integer"; 
(colname | coltype). type ="string"; 
(colno | collength). type- u tnteger"; 
/> 

Sometimes the descriptive metadata is of the same nature 
for each entry. For example, the metadata for a relational 
database usually will consist of the same kinds of descriptive 
attributes for each relational table and for each column, just 
with different values. In these cases a more compact repre- 
sentation as a table of Metadata values is possible — though 
doing imparts no additional information. That is, if the kinds 
of semantic attributes are the same for each, the same 
attribute subpath does not need to be repeated for each 
different table and value. Rather, each attribute subpath 
could be taken as a column name in a meta -table to provide 
greater conciseness. 

It should be noted that the metadata framework can be 
used to describe the structure and semantics of the 
Meta Frame representation as well as the metadata imple- 
mentation vehicles of MetaFrameAttributes and 
MetaFrameStructure tables. Thus the metadata repository is 
self-describing. This observation helps to address notions of 
"meta-meta-data". Specifically, since the metadata frame- 
work developed is sehSdescribing, there is no need for 
further levels of "meta-meta" representation formalisms. 

XML is an emerging language called the Extensible 
Markup Language. It is intended to extend and eventually 
supersede HTML (Hypertext Markup Language), and to be 
in the spirit of SGML (Standard Generalized Markup 
Language) but to be substantially simpler than SGML. 

When the S EM DAL language is compared with XML it 
can be shown that: 1) XML capabilities are subsumed, 2) 
SEMDAL is much more concise, and 3) semantic represen- 
tational capabilities are provided that go beyond the natural 
abilities of XML. 

First the SEMIDAL representation is presented, followed 
by the XML representation given by XML advocates. The 
example is that of a bookstore order system, which sells 
books, records, and coffee. The example comes from XML 
literature. The example is reexpressed in the structure SEM- 
DAL and meaningful semantic attribute information has 
been added — these semantics are not present in the XML 
version given in earlier even though that specification is 
much more verbose. The SEMDAL representation is: 



<MctaFramc BookOrdcrSchcma 



ORDER 

SOLD-TO 

PERSON 



BOOK 
RECORD 
KITTLE 
COFFEE 
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:: ID SOLD-TO SOLD-ON ITEM+ ; 

:: PERSON ; 

::LASTNAME FIRSTNAME ; 50 
ITEM :: PRICE ENTRY 

ENTRY ::BOOK | RECORD | COFFEE ; 

:: TITLE AUTHOR ; 

:: KITTLE ARTIST ; 
:: TITLE COMPOSER? ; 

:: SIZE STYLE ; 55 
•.Format -"Delimited Tagged Named Start End" ; 
".Required ="Yes" ; 
•.Type -"pedata" ; 
ORDER. ID. Format » "Attribute WithinTag" ; 
SOUD-ON.Range.lextype *» "Date.ISO8061" ; 
SOLD-ON.Range.presence - "fixed" ; 
ITEM. Occurs » "multiple" ; 
COM POSER-Fonnat Inherits - ***.Fonnat" ; 
COMPOSER.Fcrmat - "Embedded" ; 
COMPOSER-Rcquired - "No" ; 
/MetaFramo 



This Meta Frame indicates that an ORDER consists of an 
ID, the last and first names of the Person to whom the order 



60 



is sold, the Date sold, and a list of multiple ITEMS — each 
of which has a PRICE and is either a Book, a Record, or 
Coffee — alternation "|" on the right hand side of a substruc- 
ture expression means exclusive "OR". The ID is a named 
attribute inside a XML tag. 

Note that the "*. Format" declaration uses a 'wildcard' to 
define all Formats as "Delimited Tagged Named Start End". 
The semantic attribute declaration "COMPOSER.Format- 
Embedded" would override that wildcard Format. To 
include Format declarations, declare 
< COMPOSER.Format.Inherits="*.Format" * also needs to 
be declared, which means that the Format for COMPOSER 
is to inherit the Format defined by the wildcard expression. 
The approach to inheritance is the make it explicit and to 
allow it to be selective, rather than inheritance being forced, 
all or nothing, by standard subclassing mechanisms. 

All other structural information is via explicit tags, which 
name the entry and delimit both the Start and End of the 
entry. Note that "pedata" just means ASCII data, similar to 
"pedata" in SGML declarations. The Composer information 
is optional, and occurs as an embedded (inline) tag within 
the Record Title. 

The DAtabase Integration System DAISy) of the present 
invention has been described in detail and how it provides 
interoperability across heterogeneous resources, including 
databases and other forms of structured data. In this system, 
the major differences between diverse data representations 
are accommodated by high level LSD (Logical Structure 
Diagram) specification language, and by use of annotations 
to factor out of this declarative language the heterogeneity 
among different databases and data structure representa- 
tions. 

When coupled with the emphasis on program generation, 
these annotations enable generation of substantially different 
code for different systems of representation. The data trans- 
formations are expressed via a set of localized coordination 
rules which interrelate components from the source speci- 
fication with components of the target. This approach allows 
the rules to be more declarative in nature, and also supports 
asynchronous processing of the transformations, thereby 
being amenable to parallelization. 

It is, therefore, apparent that there has been provided in 
accordance with the present invention, a method for pro- 
cessing heterogeneous data that fully satisfies the aims and 
advantages hereinbefore set forth. While this invention has 
been described in conjunction with a specific embodiment 
thereof, it is evident that many alternatives, modifications, 
and variations will be apparent to those skilled in the art. 
Accordingly, it is intended to embrace all such alternatives, 
modifications and variations that fall within the spirit and 
broad scope of the appended claims. 

I claim: 

1. A method for integrating heterogeneous data embodied 
in computer readable media having source data and target 
data comprising: 

providing an interoperability assistant module with speci- 
fications for transforming the source data; 

transforming the source data into a common intermediate 
representation of the data using the specifications; 

transforming the intermediate representation of the data 
into a specialized target representation using the speci- 
fications; 

creating an information bridge with the interoperability 
assistant module through a process of program genera- 
tion; 

processing the source data through the information 
bridge; outputting the target data, wherein the target 
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data is in a non-relational form with respect to the 

source data; and 
outputting the target data, wherein the target data is in a 

non-relational form with respect to the source data, 
wherein providing the interoperability assistant with 

specifications comprises: 

inputting a first high level data structure specification 
which describes the source data representation; 

inputting a second high level data structure specifica- 
tion which describes the target data; 

inputting a high level transformation rule specification; 

processing the first high level data structure with a first 
schema analyzer and a recognizer generator to gen- 
erate a source recoanizer module; 

processing the second high level data structure with a 
second schema analyzer and a view generator to 
generate a target builder module; 

processing the high level transformation rule specifi- 
cation with a transformer generator to generate a 
transformer module; 

parsing the first high level data structure specification 
with the first schema analyzer to create an annotated 
logical structure diagram, the logical structure dia- 
gram serving as a schematic structure graph that 
represents the logical relationships of the source data 
in a context-independent uniform manner; and 

parsing the second high level data structure specifica- 
tion with the second schema analyzer to create an 
annotated logical structure diagram, the first logical 
structure diagram serving as a schematic structure 
graph that represents the logical relationships of the 
source data in a context-independent uniform man- 
ner. 

2. The method as claimed in claim 1, wherein the first 
high level data structure specification and second high level 
data structure specification comprise: 

diverse metatadata, including semantic metadata. 

3. The method as claimed in claim 1 ( wherein inputting 
the first and second high level data structure specifications 
comprises: 

programming grammar productions; 
programming type descriptions; and 
programming annotation specifications. 

4. The process as claimed in claim 3, wherein program- 
ming grammar productions comprises: 

specifying the logical structure of the heterogeneous data 
using grammar rules to form a uniform representation 
of the data, wherein a right hand side and a left hand 
side of each data statement of the heterogeneous data is 
produced. 

5. The process as claimed in claim 1, wherein parsing 
further comprises: 

forming nodes and edges of the first logical structure 
diagram, each node and edge being logicially associ- 
ated with a label and an interpretation, the label of a 
specific logical structure diagram component corre- 
sponding to a particular application schema component 
and the interpretation of the logical structure diagram 
impacting the meaning of the nodes and edges of the 
logical structure diagram and being derived from the 
annotations in the high level data structure specification 
to enable using the same high level data structure 
specification syntax and the same logical structure 
diagram contsructs to represent diverse data models 
and application schema. 

6. The process of claim 1, wherein inputting the high level 
transformation rule specification comprises: 
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applying transformations to subtrees of the logical struc- 
ture diagram schema tree for the source, each transfor- 
mation rule coordinating some input or intermediate 
data objects to other intermediate objects or output 
objects, the collection of the rules forming a data flow 
network which maps the input structure or schema to 
output structure/schema, asynchronous execution of 
the rules in the information bridge carrying out the 
actual data transformations. 

7. The method of claim 1, further comprising: 
providing a first user interface for a first user which 

interacts with the combined uniform schema and data 
obtained from multiple data sources. 

8. The method of claim 7, further comprising: 
providing a second user interface for a second user which 

interacts with existing tools and various data represen- 
tations. 

9. A method for integrating heterogeneous data embodied 
in computer readable media having source data and target 
data comprising: 

providing an interoperability assistant module with speci- 
fications for transforming the source data; 

transforming the source data into a common intermediate 
representation of the data using the specifications; 

transforming the intermediate representation of the data 
into a specialized target representation using the speci- 
fications; 

creating an information bridge with the interoperability 
assistant module through a process of program genera- 
tion; 

processing the source data through the information 
bridge; outputting the target data, wherein the target 
data is in a non-relational form with respect to the 
source data; and 

outputting the target data, wherein the target data is in a 
non-relational form with respect to the source data, 

wherein providing the interoperability assistant with 
specifications comprises: 

inputting a first high level data structure specification 
which describes the source data representation; 

inputting a second high level data structure specifica- 
tion which describes the target data; 

inputting a high level transformation rule specification; 

processing the first high level data structure with a first 
schema analyzer and a recognizer generator to gen- 
erate a source recognizer module; 

processing the second high level data structure with a 
second schema analyzer and a view generator to 
generate a target builder module; 
processing the high level transformation rule specification 

with a transformer- generator to generate a transformer 

module; 

parsing the first high level data structure specification 
with the first schema analyzer to create an annotated 
logical structure diagram, the logical structure diagram 
serving as a schematic structure graph that represents 
the logical relationships of the source data in a context- 
independent uniform manner; and 

parsing the second high level data structure specification 
with the second schema analyzer to create an annotated 
logical structure diagram, the first logical structure 
diagram serving as a schematic structure graph that 
represents the logical relationships of the source data in 
a context-independent uniform manner, 

wherein parsing the first and second high level data 
structure specifications includes forming nodes and 
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edges of the first logical structure diagram, each node 
and edge being logically associated with a label and an 
interpretation, the label of a specific logical structure 
diagram component corresponding to a particular 
application schema component and the interpretation of 
the logical structure diagram impacting the meaning of 
the nodes and edges of the logical structure diagram 
and being derived from the annotations in the high level 
data structure specification to enable using the same 
high level data structure specification syntax and the 
same logical structure diagram constructs to represent 
diverse data models and application schema. 

10. The method as claimed in claim 9, wherein the first 
high level data structure specification and second high level 
data structure specification comprise: diverse metadata, 
including semantic metadata. 

11. The method as claimed in claim 9, wherein inputting 
the first and second high level data structure specifications 
comprises: 

programming grammar productions; 
programming type descriptions; and 
programming annotation specifications. 

12. The process as claimed in claim 11, wherein program- 
ming grammar productions comprises: 

specifying the logical structure of the heterogeneous data 
using grammar rules to form a uniform representation 
of the data, wherein a right hand side and a left hand 
side of each data statement of the heterogeneous data is 
produced. 

13. The method of claim 9, further comprising: 
providing a first user interface for a first user which 

interacts with the combined uniform schema and data 
obtained from multiple data sources; and 
providing a second user interface for a second user which 
interacts with existing tools and various data represen- 
tations. 

14. A method for integrating heterogeneous data embod- 
ied in computer readable media having source data and 
target data comprising: 

providing an interoperability assistant module with speci- 
fications for transforming the source data; 

transforming the source data into a common intermediate 
representation of the data using the specifications; 

transforming the intermediate representation of the data 
into a specialized target representation using the speci- 
fications; 

creating an information bridge with the interoperability 
assistant module through a process of program genera- 
tion; 

processing the source data through the information 
bridge; outputting the target data, wherein the target 
data is in a non-relational form with respect to the 
source data; and 

outputting the target data, wherein the target data is in a 
non-relational form with respect to the source data, 

wherein providing the interoperability assistant with 
specifications comprises: 

inputting a first high level data structure specification 
which describes the source data representation; 



10 



15 



20 



25 



30 



35 



40 



45 



55 



60 



inputting a second high level data structure specifica- 
tion which describes the target data; 

inputting a high level transformation rule specification 
which includes applying transformations to subtrees 
of the logical structure diagram schema tree for the 
source, each transformation rule coordinating some 
input or intermediate data objects to other interme- 
diate objects or output objects, the collection of the 
rules forming a data flow network which maps the 
input structure or schema to output structure/schema, 
asynchronous execution of the rules in the informa- 
tion bridge carrying out the actual data transforma- 
tions; 

processing the first high level data structure with a first 
schema analyzer and a recognizer generator to gen- 
erate a source recognizer module; 

processing the second high level data structure with a 
second schema analyzer and a view generator to 
generate a target builder module; 

processing the high level transformation rule specifi- 
cation with a transformer generator to generate a 
transformer module; 

parsing the first high level data structure specification 
with the first schema analyzer to create an annotated 
logical structure diagram, the logical structure dia- 
gram serving as a schematic structure graph that 
represents the logical relationships of the source data 
in a context-independent uniform manner; and 

parsing the second high level data structure specifica- 
tion with the second schema analyzer to create an 
annotated logical structure diagram, the first logical 
structure diagram serving as a schematic structure 
graph that represents the logical relationships of the 
source data in a context-independent uniform man- 
ner. 

15. The method as claimed in claim 14, wherein the first 
high level data structure specification and second high level 
data structure specification comprise: 

diverse metadata, including semantic metadata. 

16. The method as claimed in claim 14, wherein inputting 
the first and second high level data structure specifications 
comprises: 

programming grammar productions; 
programming type descriptions; and 
programming annotation specifications. 

17. The process as claimed in claim 16, wherein program- 
ming grammar productions comprises: 

specifying the logical structure of the heterogeneous data 
using grammar rules to form a uniform representation 
of the data, wherein a right hand side and a left hand 
side of each data statement of the heterogeneous data is 
produced. 

18. The method of claim 14, further comprising: 
providing a first user interface for a first user which 

interacts with the combined uniform schema and data 
obtained from multiple data sources; and 
providing a second user interface for a second user which 
interacts with existing tools and various data represen- 
tations. 



03/31/2003, EAST Version: 1.03.0007 



